Loading ...
Sorry, an error occurred while loading the content.
 

Re: [PBML] Extracting from to

Expand Messages
  • Jeff 'japhy' Pinyan
    ... You might want to use HTML::Parser, or HTML::TokeParser. It won t be very difficult. -- Jeff japhy Pinyan japhy@pobox.com
    Message 1 of 8 , May 1, 2002
      On May 1, Greg Krieser said:

      >Is there a module out there that can extract the content that is between
      >the opening body statement (<body>) and the closing body statement
      >(</body>)?

      You might want to use HTML::Parser, or HTML::TokeParser. It won't be very
      difficult.

      --
      Jeff "japhy" Pinyan japhy@... http://www.pobox.com/~japhy/
      RPI Acacia brother #734 http://www.perlmonks.org/ http://www.cpan.org/
      ** Look for "Regular Expressions in Perl" published by Manning, in 2002 **
      <stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.
      [ I'm looking for programming work. If you like my work, let me know. ]
    • Greg Krieser
      Gordon, Thanks for the code. Tried both, but did not get them to work. I ll read more about match and see if I can figure out why it didn t. I ll let you
      Message 2 of 8 , May 1, 2002
        Gordon,

        Thanks for the code. Tried both, but did not get them to work. I'll read more about match and see if I can figure out why it didn't. I'll let you know what I find.

        Thanks,

        Greg

        The following message was sent by Gordon Stewart <gordon52@...> on Thu, 02 May 2002 01:56:22 +1200.

        > <html><body>
        >
        >
        > <tt>
        > At 07:59 1/05/02 -0500, you wrote:<BR>
        > >Is there a module out there that can extract the content that is between
        > <BR>
        > >the opening body statement (<body>) and the closing body statement
        > (</body>)?<BR>
        > <BR>
        > try� (untested) :-<BR>
        > <BR>
        > $html = (html code)<BR>
        > <BR>
        > $html =~ m/<body>(.*)<\/body>/i;<BR>
        > <BR>
        > $print $1;<BR>
        > <BR>
        > the $1 contains everything between the two BODY tags...<BR>
        > <BR>
        > or if that doesnt work, change<BR>
        > <BR>
        > $html =~ m/<body>(.*)<\/body>/i;<BR>
        > <BR>
        > <BR>
        > to<BR>
        > <BR>
        > $html =~ m/\<body\>(.*)\<\/body\>/i;<BR>
        > <BR>
        > <BR>
        > PLEASE - Can someone tell me - Do the < and > need to be 'escaped'
        > ?<BR>
        > Ive never figured that out - or asked before...<BR>
        > <BR>
        > Is there anything wrong with 'escaping' anything that doesnt need escaping
        > ?<BR>
        > <BR>
        > G.<BR>
        > <BR>
        > </tt>
        >
        > <br>
        >
        > <!-- |**|begin egp html banner|**| -->
        >
        > <table border=0 cellspacing=0 cellpadding=2>
        > <tr bgcolor=#FFFFCC>
        > <td align=center><font size="-1" color=#003399><b>Yahoo! Groups Sponsor</b></font></td>
        > </tr>
        > <tr bgcolor=#FFFFFF>
        > <td align=center width=470><table border=0 cellpadding=0 cellspacing=0><tr><td
        > align=center><font face=arial size=-2>ADVERTISEMENT</font><br><a href="http://rd.yahoo.com/M=194081.2021092.3499911.1829184/D=egroupweb/S=1705006951:HM/A=1036972/R=0/*http://www.ediets.com/start.cfm?code=3466"targe
        > t=_top><img src="http://us.a1.yimg.com/us.yimg.com/a/ed/ediets/250x300_bluechair.jpg"alt="Click
        > Here!" width="250" height="300" border="0"></a></td></tr></table></td>
        > </tr>
        > <tr><td><img alt="" width=1 height=1 src="http://us.adserver.yahoo.com/l?M=194081.2021092.3499911.1829184/D=egroupmail/S=1705006951:HM/A=1036972/rand=243157692"></td></tr>
        > </table>
        >
        > <!-- |**|end egp html banner|**| -->
        >
        >
        > <br>
        > <tt>
        > Unsubscribing info is here: <a href="http://help.yahoo.com/help/us/groups/groups-32.html">http://help.yahoo.com/help/us/groups/groups-32.html</a></tt>
        > <br>
        >
        > <br>
        > <tt>Your use of Yahoo! Groups is subject to the <a href="http://docs.yahoo.com/info/terms/">Yahoo!
        > Terms of Service</a>.</tt>
        > </br>
        >
        > </body></html>
        >
      • Jeff 'japhy' Pinyan
        ... His regex fails because . doesn t match n, and I m SURE your HTML text has newlines in it. If you wanted to be quick about it (and possibly inaccurate)
        Message 3 of 8 , May 1, 2002
          On May 1, Greg Krieser said:

          >Thanks for the code. Tried both, but did not get them to work. I'll
          >read more about match and see if I can figure out why it didn't. I'll
          >let you know what I find.

          His regex fails because . doesn't match \n, and I'm SURE your HTML text
          has newlines in it. If you wanted to be quick about it (and possibly
          inaccurate) you could use

          ($contents) = $HTML =~ m{<body>(.*)</body>}si;

          --
          Jeff "japhy" Pinyan japhy@... http://www.pobox.com/~japhy/
          RPI Acacia brother #734 http://www.perlmonks.org/ http://www.cpan.org/
          ** Look for "Regular Expressions in Perl" published by Manning, in 2002 **
          <stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.
          [ I'm looking for programming work. If you like my work, let me know. ]
        • Greg Krieser
          Thanks for everyone s help. I ve got a sample working, but not the way I d like. Could I get a little more help? Tried this parser sample code I found at:
          Message 4 of 8 , May 1, 2002
            Thanks for everyone's help. I've got a sample working, but not the way I'd like. Could I get a little more help?

            Tried this parser sample code I found at: http://www.gellyfish.com/htexamples.

            #!/usr/bin/perl -w
            package Example;
            use strict;
            require HTML::Parser;
            @Example::ISA = qw(HTML::Parser);
            my $parser = Example->new;
            $parser->parse_file('test.html');
            print $parser->{TEXT};
            sub text
            {
            my ($self,$text) = @_;
            $self->{TEXT} .= $text;
            }

            This produces some javascript that is in the test.html file, but not everything between <body> and </body>. How can I modify the code to specify this requirement?

            Thanks A Lot,

            Greg
          • daymobrew
            ... way I d like. Could I get a little more help? ... http://www.gellyfish.com/htexamples. ... not everything between and . How can I modify
            Message 5 of 8 , May 2, 2002
              --- In perl-beginner@y..., "Greg Krieser" <greg@k...> wrote:
              > Thanks for everyone's help. I've got a sample working, but not the
              way I'd like. Could I get a little more help?
              >
              > Tried this parser sample code I found at:
              http://www.gellyfish.com/htexamples.
              >
              > #!/usr/bin/perl -w
              > package Example;
              > use strict;
              > require HTML::Parser;
              > @Example::ISA = qw(HTML::Parser);
              > my $parser = Example->new;
              > $parser->parse_file('test.html');
              > print $parser->{TEXT};
              > sub text
              > {
              > my ($self,$text) = @_;
              > $self->{TEXT} .= $text;
              > }
              >  
              > This produces some javascript that is in the test.html file, but
              not everything between <body> and </body>. How can I modify the code
              to specify this requirement?
              >
              > Thanks A Lot,
              >
              > Greg

              I modified Jeff's (working) regexp. I got the modified regexp
              working. Here is the full code:

              #!/usr/local/bin/perl -w

              use strict;

              if ( open( FH, 'body.html' ) )
              {
              my $whole_file = join( '', <FH> );
              close( FH );

              $whole_file =~ s@.*<body>(.*)</body>.*@$1@si;
              print "$whole_file";
              }
            • Greg Krieser
              VERY IMPRESSIVE! Works like a champ! Thanks! This list is showing me the importance of regexps. Thanks for the help. Can t wait to implement this
              Message 6 of 8 , May 2, 2002
                VERY IMPRESSIVE! Works like a champ! Thanks!

                This list is showing me the importance of regexps. Thanks for the help. Can't wait to implement this everywhere.

                Thanks!

                The following message was sent by "daymobrew" <daymobrew@...> on Thu, 02 May 2002 13:30:53 -0000.

                > <html><body>
                >
                >
                > <tt>
                > --- In perl-beginner@y..., "Greg Krieser" <greg@k...> wrote:<BR>
                > > Thanks for everyone's help.� I've got a sample working, but not
                > the <BR>
                > way I'd like.� Could I get a little more help?<BR>
                > > <BR>
                > > Tried this parser sample code I found at: <BR>
                > <a href="http://www.gellyfish.com/htexamples.">http://www.gellyfish.com/htexamples.</a>�
                > <BR>
                > > <BR>
                > > #!/usr/bin/perl -w<BR>
                > > package Example;<BR>
                > > use strict;<BR>
                > > require HTML::Parser;<BR>
                > > @Example::ISA = qw(HTML::Parser);<BR>
                > > my $parser = Example->new;<BR>
                > > $parser->parse_file('test.html');<BR>
                > > print $parser->{TEXT};<BR>
                > > sub text<BR>
                > > {<BR>
                > > my ($self,$text) = @_;<BR>
                > > $self->{TEXT} .= $text;<BR>
                > > }<BR>
                > > �<BR>
                > > This produces some javascript that is in the test.html file, but <BR>
                > not everything between <body> and </body>.� How can I
                > modify the code <BR>
                > to specify this requirement?<BR>
                > > <BR>
                > > Thanks A Lot,<BR>
                > > <BR>
                > > Greg<BR>
                > <BR>
                > I modified Jeff's (working) regexp. I got the modified regexp <BR>
                > working. Here is the full code:<BR>
                > <BR>
                > #!/usr/local/bin/perl -w<BR>
                > <BR>
                > use strict;<BR>
                > <BR>
                > if ( open( FH, 'body.html' ) )<BR>
                > {<BR>
                > ��� my $whole_file = join( '', <FH> );<BR>
                > ��� close( FH );<BR>
                > <BR>
                > ��� $whole_file =~ s@.*<body>(.*)</body>.*@$1@si;<BR>
                > ��� print "$whole_file";<BR>
                > }<BR>
                > <BR>
                > <BR>
                > </tt>
                >
                > <br>
                >
                > <!-- |**|begin egp html banner|**| -->
                >
                > <table border=0 cellspacing=0 cellpadding=2>
                > <tr bgcolor=#FFFFCC>
                > <td align=center><font size="-1" color=#003399><b>Yahoo! Groups Sponsor</b></font></td>
                > </tr>
                > <tr bgcolor=#FFFFFF>
                > <td align=center width=470><table border=0 cellpadding=0 cellspacing=0><tr><td
                > align=center><font face=arial size=-2>ADVERTISEMENT</font><br><a href="http://rd.yahoo.com/M=225001.2005406.3486599.1971030/D=egroupweb/S=1705006951:HM/A=1044510/R=0/*http://www.gotomypc.com/u/tr/yh/grp/300_g2_01/g22lp?Target=mm/g22lp.tmpl"
                > target=_top><img src="http://us.a1.yimg.com/us.yimg.com/a/ex/expert_city/300_gotomypc_01.gif"
                > alt="Click Here!" width="300" height="250" border="0"></a></td></tr></table></td>
                > </tr>
                > <tr><td><img alt="" width=1 height=1 src="http://us.adserver.yahoo.com/l?M=225001.2005406.3486599.1971030/D=egroupmail/S=1705006951:HM/A=1044510/rand=566292783"></td></tr>
                > </table>
                >
                > <!-- |**|end egp html banner|**| -->
                >
                >
                > <br>
                > <tt>
                > Unsubscribing info is here: <a href="http://help.yahoo.com/help/us/groups/groups-32.html">http://help.yahoo.com/help/us/groups/groups-32.html</a></tt>
                > <br>
                >
                > <br>
                > <tt>Your use of Yahoo! Groups is subject to the <a href="http://docs.yahoo.com/info/terms/">Yahoo!
                > Terms of Service</a>.</tt>
                > </br>
                >
                > </body></html>
                >
              Your message has been successfully submitted and would be delivered to recipients shortly.