Loading ...
Sorry, an error occurred while loading the content.
 

Re: [PBML] Extracting from to

Expand Messages
  • Gordon Stewart
    ... try (untested) :- $html = (html code) $html =~ m/ (.*) /i; $print $1; the $1 contains everything between the two BODY tags... or if that
    Message 1 of 8 , May 1, 2002
      At 07:59 1/05/02 -0500, you wrote:
      >Is there a module out there that can extract the content that is between
      >the opening body statement (<body>) and the closing body statement (</body>)?

      try (untested) :-

      $html = (html code)

      $html =~ m/<body>(.*)<\/body>/i;

      $print $1;

      the $1 contains everything between the two BODY tags...

      or if that doesnt work, change

      $html =~ m/<body>(.*)<\/body>/i;


      to

      $html =~ m/\<body\>(.*)\<\/body\>/i;


      PLEASE - Can someone tell me - Do the < and > need to be 'escaped' ?
      Ive never figured that out - or asked before...

      Is there anything wrong with 'escaping' anything that doesnt need escaping ?

      G.
    • Jeff 'japhy' Pinyan
      ... You might want to use HTML::Parser, or HTML::TokeParser. It won t be very difficult. -- Jeff japhy Pinyan japhy@pobox.com
      Message 2 of 8 , May 1, 2002
        On May 1, Greg Krieser said:

        >Is there a module out there that can extract the content that is between
        >the opening body statement (<body>) and the closing body statement
        >(</body>)?

        You might want to use HTML::Parser, or HTML::TokeParser. It won't be very
        difficult.

        --
        Jeff "japhy" Pinyan japhy@... http://www.pobox.com/~japhy/
        RPI Acacia brother #734 http://www.perlmonks.org/ http://www.cpan.org/
        ** Look for "Regular Expressions in Perl" published by Manning, in 2002 **
        <stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.
        [ I'm looking for programming work. If you like my work, let me know. ]
      • Greg Krieser
        Gordon, Thanks for the code. Tried both, but did not get them to work. I ll read more about match and see if I can figure out why it didn t. I ll let you
        Message 3 of 8 , May 1, 2002
          Gordon,

          Thanks for the code. Tried both, but did not get them to work. I'll read more about match and see if I can figure out why it didn't. I'll let you know what I find.

          Thanks,

          Greg

          The following message was sent by Gordon Stewart <gordon52@...> on Thu, 02 May 2002 01:56:22 +1200.

          > <html><body>
          >
          >
          > <tt>
          > At 07:59 1/05/02 -0500, you wrote:<BR>
          > >Is there a module out there that can extract the content that is between
          > <BR>
          > >the opening body statement (<body>) and the closing body statement
          > (</body>)?<BR>
          > <BR>
          > try� (untested) :-<BR>
          > <BR>
          > $html = (html code)<BR>
          > <BR>
          > $html =~ m/<body>(.*)<\/body>/i;<BR>
          > <BR>
          > $print $1;<BR>
          > <BR>
          > the $1 contains everything between the two BODY tags...<BR>
          > <BR>
          > or if that doesnt work, change<BR>
          > <BR>
          > $html =~ m/<body>(.*)<\/body>/i;<BR>
          > <BR>
          > <BR>
          > to<BR>
          > <BR>
          > $html =~ m/\<body\>(.*)\<\/body\>/i;<BR>
          > <BR>
          > <BR>
          > PLEASE - Can someone tell me - Do the < and > need to be 'escaped'
          > ?<BR>
          > Ive never figured that out - or asked before...<BR>
          > <BR>
          > Is there anything wrong with 'escaping' anything that doesnt need escaping
          > ?<BR>
          > <BR>
          > G.<BR>
          > <BR>
          > </tt>
          >
          > <br>
          >
          > <!-- |**|begin egp html banner|**| -->
          >
          > <table border=0 cellspacing=0 cellpadding=2>
          > <tr bgcolor=#FFFFCC>
          > <td align=center><font size="-1" color=#003399><b>Yahoo! Groups Sponsor</b></font></td>
          > </tr>
          > <tr bgcolor=#FFFFFF>
          > <td align=center width=470><table border=0 cellpadding=0 cellspacing=0><tr><td
          > align=center><font face=arial size=-2>ADVERTISEMENT</font><br><a href="http://rd.yahoo.com/M=194081.2021092.3499911.1829184/D=egroupweb/S=1705006951:HM/A=1036972/R=0/*http://www.ediets.com/start.cfm?code=3466"targe
          > t=_top><img src="http://us.a1.yimg.com/us.yimg.com/a/ed/ediets/250x300_bluechair.jpg"alt="Click
          > Here!" width="250" height="300" border="0"></a></td></tr></table></td>
          > </tr>
          > <tr><td><img alt="" width=1 height=1 src="http://us.adserver.yahoo.com/l?M=194081.2021092.3499911.1829184/D=egroupmail/S=1705006951:HM/A=1036972/rand=243157692"></td></tr>
          > </table>
          >
          > <!-- |**|end egp html banner|**| -->
          >
          >
          > <br>
          > <tt>
          > Unsubscribing info is here: <a href="http://help.yahoo.com/help/us/groups/groups-32.html">http://help.yahoo.com/help/us/groups/groups-32.html</a></tt>
          > <br>
          >
          > <br>
          > <tt>Your use of Yahoo! Groups is subject to the <a href="http://docs.yahoo.com/info/terms/">Yahoo!
          > Terms of Service</a>.</tt>
          > </br>
          >
          > </body></html>
          >
        • Jeff 'japhy' Pinyan
          ... His regex fails because . doesn t match n, and I m SURE your HTML text has newlines in it. If you wanted to be quick about it (and possibly inaccurate)
          Message 4 of 8 , May 1, 2002
            On May 1, Greg Krieser said:

            >Thanks for the code. Tried both, but did not get them to work. I'll
            >read more about match and see if I can figure out why it didn't. I'll
            >let you know what I find.

            His regex fails because . doesn't match \n, and I'm SURE your HTML text
            has newlines in it. If you wanted to be quick about it (and possibly
            inaccurate) you could use

            ($contents) = $HTML =~ m{<body>(.*)</body>}si;

            --
            Jeff "japhy" Pinyan japhy@... http://www.pobox.com/~japhy/
            RPI Acacia brother #734 http://www.perlmonks.org/ http://www.cpan.org/
            ** Look for "Regular Expressions in Perl" published by Manning, in 2002 **
            <stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.
            [ I'm looking for programming work. If you like my work, let me know. ]
          • Greg Krieser
            Thanks for everyone s help. I ve got a sample working, but not the way I d like. Could I get a little more help? Tried this parser sample code I found at:
            Message 5 of 8 , May 1, 2002
              Thanks for everyone's help. I've got a sample working, but not the way I'd like. Could I get a little more help?

              Tried this parser sample code I found at: http://www.gellyfish.com/htexamples.

              #!/usr/bin/perl -w
              package Example;
              use strict;
              require HTML::Parser;
              @Example::ISA = qw(HTML::Parser);
              my $parser = Example->new;
              $parser->parse_file('test.html');
              print $parser->{TEXT};
              sub text
              {
              my ($self,$text) = @_;
              $self->{TEXT} .= $text;
              }

              This produces some javascript that is in the test.html file, but not everything between <body> and </body>. How can I modify the code to specify this requirement?

              Thanks A Lot,

              Greg
            • daymobrew
              ... way I d like. Could I get a little more help? ... http://www.gellyfish.com/htexamples. ... not everything between and . How can I modify
              Message 6 of 8 , May 2, 2002
                --- In perl-beginner@y..., "Greg Krieser" <greg@k...> wrote:
                > Thanks for everyone's help. I've got a sample working, but not the
                way I'd like. Could I get a little more help?
                >
                > Tried this parser sample code I found at:
                http://www.gellyfish.com/htexamples.
                >
                > #!/usr/bin/perl -w
                > package Example;
                > use strict;
                > require HTML::Parser;
                > @Example::ISA = qw(HTML::Parser);
                > my $parser = Example->new;
                > $parser->parse_file('test.html');
                > print $parser->{TEXT};
                > sub text
                > {
                > my ($self,$text) = @_;
                > $self->{TEXT} .= $text;
                > }
                >  
                > This produces some javascript that is in the test.html file, but
                not everything between <body> and </body>. How can I modify the code
                to specify this requirement?
                >
                > Thanks A Lot,
                >
                > Greg

                I modified Jeff's (working) regexp. I got the modified regexp
                working. Here is the full code:

                #!/usr/local/bin/perl -w

                use strict;

                if ( open( FH, 'body.html' ) )
                {
                my $whole_file = join( '', <FH> );
                close( FH );

                $whole_file =~ s@.*<body>(.*)</body>.*@$1@si;
                print "$whole_file";
                }
              • Greg Krieser
                VERY IMPRESSIVE! Works like a champ! Thanks! This list is showing me the importance of regexps. Thanks for the help. Can t wait to implement this
                Message 7 of 8 , May 2, 2002
                  VERY IMPRESSIVE! Works like a champ! Thanks!

                  This list is showing me the importance of regexps. Thanks for the help. Can't wait to implement this everywhere.

                  Thanks!

                  The following message was sent by "daymobrew" <daymobrew@...> on Thu, 02 May 2002 13:30:53 -0000.

                  > <html><body>
                  >
                  >
                  > <tt>
                  > --- In perl-beginner@y..., "Greg Krieser" <greg@k...> wrote:<BR>
                  > > Thanks for everyone's help.� I've got a sample working, but not
                  > the <BR>
                  > way I'd like.� Could I get a little more help?<BR>
                  > > <BR>
                  > > Tried this parser sample code I found at: <BR>
                  > <a href="http://www.gellyfish.com/htexamples.">http://www.gellyfish.com/htexamples.</a>�
                  > <BR>
                  > > <BR>
                  > > #!/usr/bin/perl -w<BR>
                  > > package Example;<BR>
                  > > use strict;<BR>
                  > > require HTML::Parser;<BR>
                  > > @Example::ISA = qw(HTML::Parser);<BR>
                  > > my $parser = Example->new;<BR>
                  > > $parser->parse_file('test.html');<BR>
                  > > print $parser->{TEXT};<BR>
                  > > sub text<BR>
                  > > {<BR>
                  > > my ($self,$text) = @_;<BR>
                  > > $self->{TEXT} .= $text;<BR>
                  > > }<BR>
                  > > �<BR>
                  > > This produces some javascript that is in the test.html file, but <BR>
                  > not everything between <body> and </body>.� How can I
                  > modify the code <BR>
                  > to specify this requirement?<BR>
                  > > <BR>
                  > > Thanks A Lot,<BR>
                  > > <BR>
                  > > Greg<BR>
                  > <BR>
                  > I modified Jeff's (working) regexp. I got the modified regexp <BR>
                  > working. Here is the full code:<BR>
                  > <BR>
                  > #!/usr/local/bin/perl -w<BR>
                  > <BR>
                  > use strict;<BR>
                  > <BR>
                  > if ( open( FH, 'body.html' ) )<BR>
                  > {<BR>
                  > ��� my $whole_file = join( '', <FH> );<BR>
                  > ��� close( FH );<BR>
                  > <BR>
                  > ��� $whole_file =~ s@.*<body>(.*)</body>.*@$1@si;<BR>
                  > ��� print "$whole_file";<BR>
                  > }<BR>
                  > <BR>
                  > <BR>
                  > </tt>
                  >
                  > <br>
                  >
                  > <!-- |**|begin egp html banner|**| -->
                  >
                  > <table border=0 cellspacing=0 cellpadding=2>
                  > <tr bgcolor=#FFFFCC>
                  > <td align=center><font size="-1" color=#003399><b>Yahoo! Groups Sponsor</b></font></td>
                  > </tr>
                  > <tr bgcolor=#FFFFFF>
                  > <td align=center width=470><table border=0 cellpadding=0 cellspacing=0><tr><td
                  > align=center><font face=arial size=-2>ADVERTISEMENT</font><br><a href="http://rd.yahoo.com/M=225001.2005406.3486599.1971030/D=egroupweb/S=1705006951:HM/A=1044510/R=0/*http://www.gotomypc.com/u/tr/yh/grp/300_g2_01/g22lp?Target=mm/g22lp.tmpl"
                  > target=_top><img src="http://us.a1.yimg.com/us.yimg.com/a/ex/expert_city/300_gotomypc_01.gif"
                  > alt="Click Here!" width="300" height="250" border="0"></a></td></tr></table></td>
                  > </tr>
                  > <tr><td><img alt="" width=1 height=1 src="http://us.adserver.yahoo.com/l?M=225001.2005406.3486599.1971030/D=egroupmail/S=1705006951:HM/A=1044510/rand=566292783"></td></tr>
                  > </table>
                  >
                  > <!-- |**|end egp html banner|**| -->
                  >
                  >
                  > <br>
                  > <tt>
                  > Unsubscribing info is here: <a href="http://help.yahoo.com/help/us/groups/groups-32.html">http://help.yahoo.com/help/us/groups/groups-32.html</a></tt>
                  > <br>
                  >
                  > <br>
                  > <tt>Your use of Yahoo! Groups is subject to the <a href="http://docs.yahoo.com/info/terms/">Yahoo!
                  > Terms of Service</a>.</tt>
                  > </br>
                  >
                  > </body></html>
                  >
                Your message has been successfully submitted and would be delivered to recipients shortly.