Loading ...
Sorry, an error occurred while loading the content.

Re: [PBML] remove tags from html pages {long)

Expand Messages
  • Charles K. Clarkson
    From: Hans ... No excuse needed, this is a group for newbies. Welcome. ... 1 #!/usr/bin/perl -w 2 # 3 open(FH, 456-459.html ); 4
    Message 1 of 1 , Feb 3, 2001
    • 0 Attachment
      From: "Hans" <hansfong@...>
      > Perl newbie here, so excuse my ignorance.

      No excuse needed, this is a group for newbies. Welcome.

      >
      > I found the thread on removing the first 10 lines very interesting. I'm
      > trying to do something with multi-line <SCRIPT>..</SCRIPT> tags in html
      > pages. I wrote the following:
      >
      1 #!/usr/bin/perl -w
      2 #
      3 open(FH,"456-459.html");
      4 @file=<FH>;
      5 close(FH);
      6
      7 foreach $_ (@file)
      8 {'s/<SCRIPT.*SCRIPT>//m'};
      9
      10 open(FH,">456-459.html");
      11 foreach $_ (@file)
      12 {print FH $_;}
      13 close(FH);

      > Line 7 gives me trouble, but I can't figure out
      > why. What do I do wrong?

      2 things:
      7 foreach $_ (@file)
      8 {'s/<SCRIPT.*SCRIPT>//m'};

      While it isn't breaking anything, there's no need for the $_
      Just write:
      foreach (@file)
      Foreach loops don't require ';' endings. Also leave off the
      single quotes:
      {s/<SCRIPT.*SCRIPT>//m}
      You can also combine the loop onto one line:
      s/<SCRIPT.*SCRIPT>//m foreach @file;

      Line 11 and 12 is done so much that perl has a shortcut.
      Instead of:
      11 foreach $_ (@file)
      12 {print FH $_;}
      You can simply:
      print FW @file;

      Unfortunately, even with these changes you will only remove
      single line SCRIPTs from your file because you're only
      proccessing 1 line at a time.

      >
      > Second thing I want to ask: there are about 100 html pages I need to clean
      > up. In bash you use something like for i in *.html; do;....; done;. How
      > is this done in perl?
      >

      I recently showed someone how to use file::find to do something
      like this. Unfortunately file::find has poor documentation. It allows
      you to traverse a directory and subdirectories pretty easily though:

      #!/usr/bin/perl -w
      use strict;
      use diagnostics;

      my $directory = qw/ c:\perl\ /;
      my $write_file = qw/ out.txt /;

      use File::Find;
      # run the sub 'wanted' for each file in $directory.
      find(\&wanted, $directory);

      sub wanted {
      # $File::Find::name contains the complete pathname to the file
      # $_ contains just the current filename
      # skip this file unless it ends with .html or .htm
      return unless /\.html|\.htm$/;

      # you should probably create a bacup file here. :)
      open (HTML, "<$File::Find::name")
      || die "Cannot open $File::Find::name $!";
      # See note 1 below
      undef $/;
      my $file = <HTML>;
      close HTML;
      # See note 2 below
      $file =~ s/<SCRIPT.*?SCRIPT>//igs;

      open (HTML, ">$File::Find::name")
      || die "Cannot open $File::Find::name $!";
      print HTML $file;
      close HTML;
      }

      Note 1:
      In the statement: @file = <HTML>; perl splits <HTML> into
      an array using newlines (\n) or whatever is in the 'Input Record
      Separator' ($/). By using: 'undef $/;' perl slurps the entire file
      into one string, which we place in $file.

      Note 2:
      Why igs?
      i is case-insensistive. 'Script' is the same as 'ScRiPT'.
      g is global - in case someone wrote more than 1 SCRIPT.
      s will, according to perlre:
      Treat string as single line. That is, change '.' to match any
      character whatsoever, even a newline, which normally it
      would not match.

      Why the '?' before 'SCRIPT>'?
      To stop greediness. Here's an example:

      <HTML>
      . . .
      <SCRIPT>1
      . . .
      </SCRIPT>1
      <P>This is some stuff we'd like to keep</P>
      <SCRIPT>2
      . . .
      </SCRIPT>2

      Without '?' perl would match from the first '<SCRIPT>'
      to the second '</SCRIPT>'.

      We could add a little here so we don't write files that don't
      change. It would speed up our program also.
      return unless ($file =~ s/<SCRIPT.*?SCRIPT>//igs);
      This will continue only if s/// matched something.

      A word of caution: The code above compiles okay, but I
      didn't test it completely. Make sure that the $directory you
      specify is a copy or that the files you process are backed up.
      Also note that file::find recurses through subdirectories. If
      don't want subdirectories included, you'll have to use
      opendir. Good Luck.

      HTH,
      Charles K. Clarkson
    Your message has been successfully submitted and would be delivered to recipients shortly.