Loading ...
Sorry, an error occurred while loading the content.

html file to cvs

Expand Messages
  • Nikolaos A. Patsopoulos
    Hi all, I have the following problem: I have a huge pack of html files ( 1000) and I want to extract some info on cvs files. The html source looks like this:
    Message 1 of 3 , Sep 6, 2006
    • 0 Attachment
      Hi all,

      I have the following problem:

      I have a huge pack of html files (>1000) and I want to extract some info
      on cvs files. The html source looks like this:


      ...../code /

      <b>Source:</b>/code/ 338 (13): 853-860 MAR 26 1998 /code
      /<b>Addresses:</b>/code/
      <a href="http:...../code/"> Northwestern Univ,</a>/code

      /the above block is repeated <=20 times.

      I want a cvs file that will look like this:

      1998;Northwestern Univ;
      1998;ETc;
      ....



      I tried few things but I cannot reach a working code. One of main issues
      is how discriminate years from 4digit pages, e.g 1987.

      I have attached a html source file (since it is impossible to cut & paste
      the whole code here) but I've got a failure notice.

      I have published a html file in the following address: http://users.uoi.gr/npatsop/portal_002.htm

      Thanks in advance,

      Nikos

      --
      Nikolaos A. Patsopoulos, MD
      Department of Hygiene and Epidemiology
      University of Ioannina School of Medicine
      University Campus
      Ioannina 45110
      Greece
      Tel: (+30) 26510-97804
      mobile: +30 6972882016
      Fax: (+30) 26510-97867 (care of Nikolaos A. Patsopoulos)
      e-mail: npatsop@...
    • Devin Weaver
      I don t fully understand what you mean by a cvs file whether that refers to a congruent visioning file or if you meant a comma separated values file. Based on
      Message 2 of 3 , Sep 6, 2006
      • 0 Attachment
        I don't fully understand what you mean by a cvs file whether that
        refers to a congruent visioning file or if you meant a comma
        separated values file. Based on the sample output I'm assuming a CSV
        file using semi-colons.

        I choose PERL at the Swiss-Army knife of scripts and was able to whip
        up a parser in about fifteen minutes. attached is what I came up with.

        I left the loading of multiple files to the student. I used mainly
        regular expressions so it could be ported to VIM script in theory but
        this type of parsing would be better suited for a scripting language
        not an editor.

        Hope this gives some inspiration.

        On Sep 6, 2006, at 06:14, Nikolaos A. Patsopoulos wrote:
        > I have a huge pack of html files (>1000) and I want to extract some
        > info on cvs files.
      • Nikolaos A. Patsopoulos
        ... Thanks for the time and effort. I work on WinXP machine and cannot brag for my Perl knowledge. From the very few code I can understand it seems that you
        Message 3 of 3 , Sep 6, 2006
        • 0 Attachment
          Devin Weaver wrote:
          > I don't fully understand what you mean by a cvs file whether that
          > refers to a congruent visioning file or if you meant a comma separated
          > values file. Based on the sample output I'm assuming a CSV file using
          > semi-colons.
          >
          > I choose PERL at the Swiss-Army knife of scripts and was able to whip
          > up a parser in about fifteen minutes. attached is what I came up with.
          >
          > I left the loading of multiple files to the student. I used mainly
          > regular expressions so it could be ported to VIM script in theory but
          > this type of parsing would be better suited for a scripting language
          > not an editor.
          >
          > Hope this gives some inspiration.
          >
          > On Sep 6, 2006, at 06:14, Nikolaos A. Patsopoulos wrote:
          >> I have a huge pack of html files (>1000) and I want to extract some
          >> info on cvs files.
          >
          > ------------------------------------------------------------------------
          >
          > #!/usr/bin/perl
          >
          > # Very simple script to parse a specific styled HTML document and output a file
          > # parsed with a delimiter.
          > #
          > # The folowing are the settings. Pick what you need. Using command line
          > # arguments left for the student.
          >
          > $file = "portal_002.htm";
          > $output = "out.csv";
          > $csv_delim = ';';
          > $quiet = 0; # set this to 1 to stop debug output
          >
          > $months_pat = "(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)";
          >
          > ######
          > sub msg
          > {
          > my $str = shift;
          > my $line_no = shift;
          >
          > if (!$quiet)
          > {
          > print $str;
          > if ($line_no ne "")
          > {
          > print " (line: $line_no)";
          > }
          > print "\n";
          > }
          > }
          >
          > $line_no = 0; # used to track the line number.
          > open FD, "<$file" || die "Could not open file";
          > open OUT, ">$output" || die "Unable to open output file";
          > while ($line = <FD>)
          > {
          > $line_no++;
          > if ($line =~ /Source:/i)
          > {
          > $line =~ /$months_pat\s+[0-9]+\s+([0-9]+)/i;
          > $year = $2;
          > msg ("Found 'Source:'; Year = $year", $line_no);
          > }
          > elsif ($line =~ /Addresses:/i)
          > {
          > $line =~ /<a(\s.+?)?>(.+?)<\/a>/i;
          > $univ = $2;
          > $univ =~ s/^\s+//;
          > $univ =~ s/(\s+|[,;])$//;
          > # pull out the HTML &
          > $univ =~ s/&/&/gi;
          > msg (" Child Found 'Addresses:'; Univ = $univ", $line_no);
          > # Since this should be the end of the record write to file.
          > print OUT "$year$csv_delim$univ$csv_delim\n";
          > }
          > }
          > close OUT;
          > close FD;
          > msg ("Done. (Parsed $line_no lines) CSV output to $output", "");
          >
          >
          > ------------------------------------------------------------------------
          >
          >
          > ------------------------------------------------------------------------
          >
          > No virus found in this incoming message.
          > Checked by AVG Free Edition.
          > Version: 7.1.405 / Virus Database: 268.11.7/438 - Release Date: 5/9/2006
          >
          Thanks for the time and effort. I work on WinXP machine and cannot brag
          for my Perl knowledge. From the very few code I can understand it seems
          that you are close to what I want to do but much are missing. I'm sorry
          but I'm unable to follow a Perl script.

          Thanks,


          Nikos
        Your message has been successfully submitted and would be delivered to recipients shortly.