Loading ...
Sorry, an error occurred while loading the content.

Re: [PBML] Getting multi-line data from a text file. (long)

Expand Messages
  • Charles K. Clarkson
    Tony Austin ... This is a beginners group. We expect simple questions. Apologize for database design and multiple iheritance questions.
    Message 1 of 1 , Jan 29, 2002
    • 0 Attachment
      "Tony Austin" <tony@...>

      : Sorry for a simple question.....

      This is a beginners group. We expect simple questions.
      Apologize for database design and multiple iheritance
      questions.

      : I am attempting to collect data from a text file (html page)...
      :
      : The data I am after is embedded within multiple lines of the file.
      :
      : For example, from the HTML file:
      :
      : <B>text line1</B>
      : <br>text line 2
      : <br>
      : text line 3
      : <br>
      : text line 4

      Because of the nature of HTML, it is usually easier
      to use a HTML parser than to write your own. A lot of
      beginners avoid modules. Unless you're trying to expand
      your own personal programming skills, modules are the
      best route. As it happens an HTML parser comes with
      perl. But first let's take a quick look at wht you wrote.

      : There may be 2 to 50 occurrences of this common string
      : within the file. I can write a regex to pull the data I am
      : after but am having trouble dealing with the multiple
      : line format of the data.
      :
      : I have attempted

      This is your script with indetation:

      foreach $line (@file) {
      chomp($line);
      $nn++;
      if ($line =~ m/<b>([^<.*>]*)<\/b>/ig) {
      #found name
      $dat1 = $1;
      next;
      $_ =~ m/<br>([a-z0-9]*)/ig;
      $dat2 = $1;
      next;
      next;
      $_ =~ s/ / /ig;
      @parts = split(/,/, $_);
      $dat3 = $parts[0];
      $parts[1] =~ m/([^\s]*)[\s]*([^\s]*)$/i;
      $dat4 = $1;
      $dat5 = $2;
      next;
      next;
      $_ =~ s/ / /ig;
      $dat6 = $_;

      print LOG "$dat1|$dat2|$dat3|$dat4|$dat5|$dat6\n";
      }
      }

      The first 'next' restarts the for block, So we end up with:

      foreach $line (@file) {
      chomp($line);
      $nn++;
      if ($line =~ m/<b>([^<.*>]*)<\/b>/ig) {
      #found name
      $dat1 = $1;
      }
      }

      The rest is never executes and this doesn't do anything.

      It would have been helpfull if you had included a few
      lines of your data file, but we try winging it. Anything
      located after the __END__ or __DATA__ tags in a perl
      script can be accessed through a special file handle
      variable called DATA.

      We can outline the script as:

      use strict;
      use warnings;


      __END__
      <B>text
      line1<br>
      text line 2
      </B>
      <br>
      text line 3
      <br><B>text line4</B>
      text line5


      I changed this up a bit to allow for some of the
      cases where a straight forward regex might fail.
      Without your actual data file I'm only guessing.

      I note from your snippet that you are not using
      strict. I always use strict with warnings turned on.
      That's why you will see a lot of 'my's in this example.

      I have chosen HTML::TokeParser because it's
      included with the standard perl distribution and it
      has great axamples in its documentation.

      Here's the guts of my first attempt with
      HTML::TokeParser:


      use HTML::TokeParser;

      my $p = HTML::TokeParser->new(\*DATA);

      while ( $p->get_tag ) {
      print $p->get_trimmed_text, "\n";
      }

      which printed:

      text line 1
      text line 2

      text line 3

      text line 4
      text line 5

      Not exactly what we wanted but damn close.

      Before we fix it let's look at what's happening
      line by line.

      use HTML::TokeParser;

      This tells perl we need whats inside of
      HTML::TokeParser.pm.


      my $p = HTML::TokeParser->new(\*DATA);

      This is more complicated and will require us to
      read the documentation for full details and options.
      We are creating an object here and putting a
      reference to that object into the scalar variable '$p'.
      Further, we are telling HTML::TokeParser to use the
      contents of the DATA filehandle as a model for the
      object.

      We can now access certain methods in the $p
      object by prefixing the method with '$p->'. One
      method of the object in $p is the get_tag
      method. We refer to this with:

      $p->get_tag

      This method returns yet another object or undef
      when it gets to the end of the data.

      while ( $p->get_tag ) {

      will get the next tag until it reaches the end of
      the data, which will return undef and exit the while
      block.


      print $p->get_trimmed_text, "\n";

      Prints the return value of the get_trimmed_text
      method from the object in $p.

      You can read mre about perl references and
      objects in perlref, perltoot, and perlobj among
      other files listed in the standard perldistribution.

      ___________________

      So how do we get rid of those blank lines?
      Well the documentation for HTML::TokeParser
      gets us fixed up pretty quick:

      while ( $p->get_tag ) {
      my $text = $p->get_trimmed_text;
      print "$text\n" if $text;
      }


      I'll let you read the docs to find out how this
      works as it seems my post has become typically
      long. If you need more help just ask and please
      stop apologizing. Many of us live to answer
      questions just like yours.



      HTH,
      Charles K. Clarkson
      Clarkson Energy Homes, Inc.
      254 968-8328


      You can lead a man to logic, but you can't make him think.


      As tested:

      #! /usr/bin/perl
      use strict;
      use warnings;

      use HTML::TokeParser;

      my $p = HTML::TokeParser->new(\*DATA);

      while ( $p->get_tag ) {
      my $text = $p->get_trimmed_text;
      print "$text\n" if $text;
      }


      __END__
      <B>text
      line 1<br>
      text line 2
      </B
      >
      <br>
      text line 3
      <br><B>text line 4</B>
      text line 5
    Your message has been successfully submitted and would be delivered to recipients shortly.