Loading ...
Sorry, an error occurred while loading the content.

Extracting from to

Expand Messages
  • Greg Krieser
    Is there a module out there that can extract the content that is between the opening body statement ( ) and the closing body statement ( )? Thanks,
    Message 1 of 8 , May 1, 2002
    • 0 Attachment
      Is there a module out there that can extract the content that is between the opening body statement (<body>) and the closing body statement (</body>)?


      Thanks,

      Greg
    • Gordon Stewart
      ... try (untested) :- $html = (html code) $html =~ m/ (.*) /i; $print $1; the $1 contains everything between the two BODY tags... or if that
      Message 2 of 8 , May 1, 2002
      • 0 Attachment
        At 07:59 1/05/02 -0500, you wrote:
        >Is there a module out there that can extract the content that is between
        >the opening body statement (<body>) and the closing body statement (</body>)?

        try (untested) :-

        $html = (html code)

        $html =~ m/<body>(.*)<\/body>/i;

        $print $1;

        the $1 contains everything between the two BODY tags...

        or if that doesnt work, change

        $html =~ m/<body>(.*)<\/body>/i;


        to

        $html =~ m/\<body\>(.*)\<\/body\>/i;


        PLEASE - Can someone tell me - Do the < and > need to be 'escaped' ?
        Ive never figured that out - or asked before...

        Is there anything wrong with 'escaping' anything that doesnt need escaping ?

        G.
      • Jeff 'japhy' Pinyan
        ... You might want to use HTML::Parser, or HTML::TokeParser. It won t be very difficult. -- Jeff japhy Pinyan japhy@pobox.com
        Message 3 of 8 , May 1, 2002
        • 0 Attachment
          On May 1, Greg Krieser said:

          >Is there a module out there that can extract the content that is between
          >the opening body statement (<body>) and the closing body statement
          >(</body>)?

          You might want to use HTML::Parser, or HTML::TokeParser. It won't be very
          difficult.

          --
          Jeff "japhy" Pinyan japhy@... http://www.pobox.com/~japhy/
          RPI Acacia brother #734 http://www.perlmonks.org/ http://www.cpan.org/
          ** Look for "Regular Expressions in Perl" published by Manning, in 2002 **
          <stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.
          [ I'm looking for programming work. If you like my work, let me know. ]
        • Greg Krieser
          Gordon, Thanks for the code. Tried both, but did not get them to work. I ll read more about match and see if I can figure out why it didn t. I ll let you
          Message 4 of 8 , May 1, 2002
          • 0 Attachment
            Gordon,

            Thanks for the code. Tried both, but did not get them to work. I'll read more about match and see if I can figure out why it didn't. I'll let you know what I find.

            Thanks,

            Greg

            The following message was sent by Gordon Stewart <gordon52@...> on Thu, 02 May 2002 01:56:22 +1200.

            > <html><body>
            >
            >
            > <tt>
            > At 07:59 1/05/02 -0500, you wrote:<BR>
            > >Is there a module out there that can extract the content that is between
            > <BR>
            > >the opening body statement (<body>) and the closing body statement
            > (</body>)?<BR>
            > <BR>
            > try� (untested) :-<BR>
            > <BR>
            > $html = (html code)<BR>
            > <BR>
            > $html =~ m/<body>(.*)<\/body>/i;<BR>
            > <BR>
            > $print $1;<BR>
            > <BR>
            > the $1 contains everything between the two BODY tags...<BR>
            > <BR>
            > or if that doesnt work, change<BR>
            > <BR>
            > $html =~ m/<body>(.*)<\/body>/i;<BR>
            > <BR>
            > <BR>
            > to<BR>
            > <BR>
            > $html =~ m/\<body\>(.*)\<\/body\>/i;<BR>
            > <BR>
            > <BR>
            > PLEASE - Can someone tell me - Do the < and > need to be 'escaped'
            > ?<BR>
            > Ive never figured that out - or asked before...<BR>
            > <BR>
            > Is there anything wrong with 'escaping' anything that doesnt need escaping
            > ?<BR>
            > <BR>
            > G.<BR>
            > <BR>
            > </tt>
            >
            > <br>
            >
            > <!-- |**|begin egp html banner|**| -->
            >
            > <table border=0 cellspacing=0 cellpadding=2>
            > <tr bgcolor=#FFFFCC>
            > <td align=center><font size="-1" color=#003399><b>Yahoo! Groups Sponsor</b></font></td>
            > </tr>
            > <tr bgcolor=#FFFFFF>
            > <td align=center width=470><table border=0 cellpadding=0 cellspacing=0><tr><td
            > align=center><font face=arial size=-2>ADVERTISEMENT</font><br><a href="http://rd.yahoo.com/M=194081.2021092.3499911.1829184/D=egroupweb/S=1705006951:HM/A=1036972/R=0/*http://www.ediets.com/start.cfm?code=3466"targe
            > t=_top><img src="http://us.a1.yimg.com/us.yimg.com/a/ed/ediets/250x300_bluechair.jpg"alt="Click
            > Here!" width="250" height="300" border="0"></a></td></tr></table></td>
            > </tr>
            > <tr><td><img alt="" width=1 height=1 src="http://us.adserver.yahoo.com/l?M=194081.2021092.3499911.1829184/D=egroupmail/S=1705006951:HM/A=1036972/rand=243157692"></td></tr>
            > </table>
            >
            > <!-- |**|end egp html banner|**| -->
            >
            >
            > <br>
            > <tt>
            > Unsubscribing info is here: <a href="http://help.yahoo.com/help/us/groups/groups-32.html">http://help.yahoo.com/help/us/groups/groups-32.html</a></tt>
            > <br>
            >
            > <br>
            > <tt>Your use of Yahoo! Groups is subject to the <a href="http://docs.yahoo.com/info/terms/">Yahoo!
            > Terms of Service</a>.</tt>
            > </br>
            >
            > </body></html>
            >
          • Jeff 'japhy' Pinyan
            ... His regex fails because . doesn t match n, and I m SURE your HTML text has newlines in it. If you wanted to be quick about it (and possibly inaccurate)
            Message 5 of 8 , May 1, 2002
            • 0 Attachment
              On May 1, Greg Krieser said:

              >Thanks for the code. Tried both, but did not get them to work. I'll
              >read more about match and see if I can figure out why it didn't. I'll
              >let you know what I find.

              His regex fails because . doesn't match \n, and I'm SURE your HTML text
              has newlines in it. If you wanted to be quick about it (and possibly
              inaccurate) you could use

              ($contents) = $HTML =~ m{<body>(.*)</body>}si;

              --
              Jeff "japhy" Pinyan japhy@... http://www.pobox.com/~japhy/
              RPI Acacia brother #734 http://www.perlmonks.org/ http://www.cpan.org/
              ** Look for "Regular Expressions in Perl" published by Manning, in 2002 **
              <stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.
              [ I'm looking for programming work. If you like my work, let me know. ]
            • Greg Krieser
              Thanks for everyone s help. I ve got a sample working, but not the way I d like. Could I get a little more help? Tried this parser sample code I found at:
              Message 6 of 8 , May 1, 2002
              • 0 Attachment
                Thanks for everyone's help. I've got a sample working, but not the way I'd like. Could I get a little more help?

                Tried this parser sample code I found at: http://www.gellyfish.com/htexamples.

                #!/usr/bin/perl -w
                package Example;
                use strict;
                require HTML::Parser;
                @Example::ISA = qw(HTML::Parser);
                my $parser = Example->new;
                $parser->parse_file('test.html');
                print $parser->{TEXT};
                sub text
                {
                my ($self,$text) = @_;
                $self->{TEXT} .= $text;
                }

                This produces some javascript that is in the test.html file, but not everything between <body> and </body>. How can I modify the code to specify this requirement?

                Thanks A Lot,

                Greg
              • daymobrew
                ... way I d like. Could I get a little more help? ... http://www.gellyfish.com/htexamples. ... not everything between and . How can I modify
                Message 7 of 8 , May 2, 2002
                • 0 Attachment
                  --- In perl-beginner@y..., "Greg Krieser" <greg@k...> wrote:
                  > Thanks for everyone's help. I've got a sample working, but not the
                  way I'd like. Could I get a little more help?
                  >
                  > Tried this parser sample code I found at:
                  http://www.gellyfish.com/htexamples.
                  >
                  > #!/usr/bin/perl -w
                  > package Example;
                  > use strict;
                  > require HTML::Parser;
                  > @Example::ISA = qw(HTML::Parser);
                  > my $parser = Example->new;
                  > $parser->parse_file('test.html');
                  > print $parser->{TEXT};
                  > sub text
                  > {
                  > my ($self,$text) = @_;
                  > $self->{TEXT} .= $text;
                  > }
                  >  
                  > This produces some javascript that is in the test.html file, but
                  not everything between <body> and </body>. How can I modify the code
                  to specify this requirement?
                  >
                  > Thanks A Lot,
                  >
                  > Greg

                  I modified Jeff's (working) regexp. I got the modified regexp
                  working. Here is the full code:

                  #!/usr/local/bin/perl -w

                  use strict;

                  if ( open( FH, 'body.html' ) )
                  {
                  my $whole_file = join( '', <FH> );
                  close( FH );

                  $whole_file =~ s@.*<body>(.*)</body>.*@$1@si;
                  print "$whole_file";
                  }
                • Greg Krieser
                  VERY IMPRESSIVE! Works like a champ! Thanks! This list is showing me the importance of regexps. Thanks for the help. Can t wait to implement this
                  Message 8 of 8 , May 2, 2002
                  • 0 Attachment
                    VERY IMPRESSIVE! Works like a champ! Thanks!

                    This list is showing me the importance of regexps. Thanks for the help. Can't wait to implement this everywhere.

                    Thanks!

                    The following message was sent by "daymobrew" <daymobrew@...> on Thu, 02 May 2002 13:30:53 -0000.

                    > <html><body>
                    >
                    >
                    > <tt>
                    > --- In perl-beginner@y..., "Greg Krieser" <greg@k...> wrote:<BR>
                    > > Thanks for everyone's help.� I've got a sample working, but not
                    > the <BR>
                    > way I'd like.� Could I get a little more help?<BR>
                    > > <BR>
                    > > Tried this parser sample code I found at: <BR>
                    > <a href="http://www.gellyfish.com/htexamples.">http://www.gellyfish.com/htexamples.</a>�
                    > <BR>
                    > > <BR>
                    > > #!/usr/bin/perl -w<BR>
                    > > package Example;<BR>
                    > > use strict;<BR>
                    > > require HTML::Parser;<BR>
                    > > @Example::ISA = qw(HTML::Parser);<BR>
                    > > my $parser = Example->new;<BR>
                    > > $parser->parse_file('test.html');<BR>
                    > > print $parser->{TEXT};<BR>
                    > > sub text<BR>
                    > > {<BR>
                    > > my ($self,$text) = @_;<BR>
                    > > $self->{TEXT} .= $text;<BR>
                    > > }<BR>
                    > > �<BR>
                    > > This produces some javascript that is in the test.html file, but <BR>
                    > not everything between <body> and </body>.� How can I
                    > modify the code <BR>
                    > to specify this requirement?<BR>
                    > > <BR>
                    > > Thanks A Lot,<BR>
                    > > <BR>
                    > > Greg<BR>
                    > <BR>
                    > I modified Jeff's (working) regexp. I got the modified regexp <BR>
                    > working. Here is the full code:<BR>
                    > <BR>
                    > #!/usr/local/bin/perl -w<BR>
                    > <BR>
                    > use strict;<BR>
                    > <BR>
                    > if ( open( FH, 'body.html' ) )<BR>
                    > {<BR>
                    > ��� my $whole_file = join( '', <FH> );<BR>
                    > ��� close( FH );<BR>
                    > <BR>
                    > ��� $whole_file =~ s@.*<body>(.*)</body>.*@$1@si;<BR>
                    > ��� print "$whole_file";<BR>
                    > }<BR>
                    > <BR>
                    > <BR>
                    > </tt>
                    >
                    > <br>
                    >
                    > <!-- |**|begin egp html banner|**| -->
                    >
                    > <table border=0 cellspacing=0 cellpadding=2>
                    > <tr bgcolor=#FFFFCC>
                    > <td align=center><font size="-1" color=#003399><b>Yahoo! Groups Sponsor</b></font></td>
                    > </tr>
                    > <tr bgcolor=#FFFFFF>
                    > <td align=center width=470><table border=0 cellpadding=0 cellspacing=0><tr><td
                    > align=center><font face=arial size=-2>ADVERTISEMENT</font><br><a href="http://rd.yahoo.com/M=225001.2005406.3486599.1971030/D=egroupweb/S=1705006951:HM/A=1044510/R=0/*http://www.gotomypc.com/u/tr/yh/grp/300_g2_01/g22lp?Target=mm/g22lp.tmpl"
                    > target=_top><img src="http://us.a1.yimg.com/us.yimg.com/a/ex/expert_city/300_gotomypc_01.gif"
                    > alt="Click Here!" width="300" height="250" border="0"></a></td></tr></table></td>
                    > </tr>
                    > <tr><td><img alt="" width=1 height=1 src="http://us.adserver.yahoo.com/l?M=225001.2005406.3486599.1971030/D=egroupmail/S=1705006951:HM/A=1044510/rand=566292783"></td></tr>
                    > </table>
                    >
                    > <!-- |**|end egp html banner|**| -->
                    >
                    >
                    > <br>
                    > <tt>
                    > Unsubscribing info is here: <a href="http://help.yahoo.com/help/us/groups/groups-32.html">http://help.yahoo.com/help/us/groups/groups-32.html</a></tt>
                    > <br>
                    >
                    > <br>
                    > <tt>Your use of Yahoo! Groups is subject to the <a href="http://docs.yahoo.com/info/terms/">Yahoo!
                    > Terms of Service</a>.</tt>
                    > </br>
                    >
                    > </body></html>
                    >
                  Your message has been successfully submitted and would be delivered to recipients shortly.