Loading ...
Sorry, an error occurred while loading the content.

Re: [PBML] converting html to txt

Expand Messages
  • Bryan Irvine
    The .gz is a compressed extension, sorta like .zip on windows or .sit on a mac. First you need to decomress it. from the command line type gunzip
    Message 1 of 8 , Dec 20, 2002
    • 0 Attachment
      The .gz is a compressed extension, sorta like .zip on windows or .sit on
      a mac. First you need to decomress it.

      from the command line type

      "gunzip striphtml.gz" and you should be left with something usable.


      --Bryan

      On Fri, 2002-12-20 at 16:06, Mertens Bram wrote:
      > Hi,
      >
      > I want to convert some html-files into txt-files. Unfortunately the
      > html-files seem to be created by WYSIWYG-editors because the code looks
      > horrible.
      >
      > I found a very useful script on
      > http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz
      > but the .gz extension confuses me (and more importantly my shell).
      >
      > And I would like to change one little thing about this script:
      > Whenever the tag "<br>" isn't followed by a newline the text ends up
      > pasted together.
      >
      > I have a little script that converts the "<br>" tag into a newline and I
      > run the files through this script first but I would like to combine
      > these scripts into one is this possible?
      >
      > My script to remove the "<br>" tags:
      > =====================================
      > #!/usr/bin/perl
      >
      > my $record = '';
      > while (<>) {
      > s/<br>|<BR>/\n/g;
      > $record .= $_;
      > }
      > print "$record\n" if $record;
      > =====================================
      >
      > All suggestions and comments are welcome!
      >
      > TIA
      > --
      > # Mertens Bram "M8ram" <m8ram.list@...> Linux User #249103 #
      > # Red Hat Linux release 7.3 (Valhalla) kernel 2.4.18-3 i686 128MB RAM #
      > # 11:50pm up 4 days, 13:55, 1 user, load average: 0.17, 0.25, 0.17 #
      >
      >
      >
      > Unsubscribing info is here: http://help.yahoo.com/help/us/groups/groups-32.html
      >
      > Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
      >
      >
    • Mertens Bram
      Hi, I want to convert some html-files into txt-files. Unfortunately the html-files seem to be created by WYSIWYG-editors because the code looks horrible. I
      Message 2 of 8 , Dec 20, 2002
      • 0 Attachment
        Hi,

        I want to convert some html-files into txt-files. Unfortunately the
        html-files seem to be created by WYSIWYG-editors because the code looks
        horrible.

        I found a very useful script on
        http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz
        but the .gz extension confuses me (and more importantly my shell).

        And I would like to change one little thing about this script:
        Whenever the tag "<br>" isn't followed by a newline the text ends up
        pasted together.

        I have a little script that converts the "<br>" tag into a newline and I
        run the files through this script first but I would like to combine
        these scripts into one is this possible?

        My script to remove the "<br>" tags:
        =====================================
        #!/usr/bin/perl

        my $record = '';
        while (<>) {
        s/<br>|<BR>/\n/g;
        $record .= $_;
        }
        print "$record\n" if $record;
        =====================================

        All suggestions and comments are welcome!

        TIA
        --
        # Mertens Bram "M8ram" <m8ram.list@...> Linux User #249103 #
        # Red Hat Linux release 7.3 (Valhalla) kernel 2.4.18-3 i686 128MB RAM #
        # 11:50pm up 4 days, 13:55, 1 user, load average: 0.17, 0.25, 0.17 #
      • Mystik Gotan
        No, change the filename, like: $filename =~ s/ .htm$/ .txt$/; And delete all HTML tags, if you wish. Also, I d change to n, for readibility purposes.
        Message 3 of 8 , Dec 24, 2002
        • 0 Attachment
          No, change the filename, like:

          $filename =~ s/\.htm$/\.txt$/;

          And delete all HTML tags, if you wish. Also, I'd change <br> to \n, for
          readibility purposes.

          $content =~ s/\b\<br\>/\n/;
          $content =~ s/\b\<\w+\>//;

          I don't think this exactly regex will work, but something LIKE this will.
          You might need to contact the more advanced Regexers ;)

          --------------
          Bob Erinkveld (Webmaster Insane Hosts)
          www.insane-hosts.net
          MSN: gotan2k3@...




          >From: Mertens Bram <m8ram.list@...>
          >Reply-To: perl-beginner@yahoogroups.com
          >To: perl-beginner mailing list <perl-beginner@yahoogroups.com>
          >Subject: Re: [PBML] converting html to txt
          >Date: 24 Dec 2002 13:32:03 +0000
          >
          >Hi,
          >
          >Bryan thanks for your response but I'm afraid it was a user-error...
          >
          >I thought it was a compressed file too but when I tried to uncompress it
          >I got the following:
          >[M8ram@localhost bin]$ unzip striphtml.gz
          >Archive: striphtml.gz
          > End-of-central-directory signature not found. Either this file is not
          > a zipfile, or it constitutes one disk of a multi-part archive. In the
          > latter case the central directory and zipfile comment will be found on
          > the last disk(s) of this archive.
          >unzip: cannot find zipfile directory in one of striphtml.gz or
          > striphtml.gz.zip, and cannot find striphtml.gz.ZIP, period.
          >[M8ram@localhost bin]$ file striphtml.gz
          >striphtml.gz: perl script text executable
          >
          >So I simply renamed the file (which already made it usable) and
          >compressed that file for comparison:
          >[M8ram@localhost bin]$ file striphtml-compressed.tar.gz
          >striphtml-compressed.tar.gz: gzip compressed data, deflated, last
          >modified: Tue Dec 24 13:23:33 2002, os: Unix
          >[M8ram@localhost bin]$ file striphtml.gz
          >striphtml.gz: perl script text executable
          >-rwx------ 1 M8ram M8ram 11k Dec 20 23:56 striphtml
          >-rw-rw-r-- 1 M8ram M8ram 3.7k Dec 24 13:23
          >striphtml-compressed.tar.gz
          >-rw------- 1 M8ram M8ram 11k Dec 18 11:07 striphtml.gz
          >
          >Could it be that the file was compressed on the server but that my
          >browser (opera 6.11) uncompressed it before displaying? I saved the file
          >to my hd after reading it with my browser and kept the extension because
          >I thought it was necessary...
          >
          >The renamed version works fine except for the part about replacing the
          >'<br>' tags with newlines.
          >
          >Thanks, another mystery solved! :)
          >
          >p.s. I only replied this late because your reply (and several other
          >messages) never made it to my mailbox...
          >--
          > # Mertens Bram "M8ram" <m8ram.list@...> Linux User #249103 #
          > # Red Hat Linux release 7.3 (Valhalla) kernel 2.4.18-3 i686 128MB RAM #
          > # 1:30pm up 8 days, 3:35, 1 user, load average: 0.13, 0.13, 0.10 #
          >


          _________________________________________________________________
          MSN Zoeken, voor duidelijke zoekresultaten!
          http://search.msn.nl/worldwide.asp
        • Mertens Bram
          Hi, Bryan thanks for your response but I m afraid it was a user-error... I thought it was a compressed file too but when I tried to uncompress it I got the
          Message 4 of 8 , Dec 24, 2002
          • 0 Attachment
            Hi,

            Bryan thanks for your response but I'm afraid it was a user-error...

            I thought it was a compressed file too but when I tried to uncompress it
            I got the following:
            [M8ram@localhost bin]$ unzip striphtml.gz
            Archive: striphtml.gz
            End-of-central-directory signature not found. Either this file is not
            a zipfile, or it constitutes one disk of a multi-part archive. In the
            latter case the central directory and zipfile comment will be found on
            the last disk(s) of this archive.
            unzip: cannot find zipfile directory in one of striphtml.gz or
            striphtml.gz.zip, and cannot find striphtml.gz.ZIP, period.
            [M8ram@localhost bin]$ file striphtml.gz
            striphtml.gz: perl script text executable

            So I simply renamed the file (which already made it usable) and
            compressed that file for comparison:
            [M8ram@localhost bin]$ file striphtml-compressed.tar.gz
            striphtml-compressed.tar.gz: gzip compressed data, deflated, last
            modified: Tue Dec 24 13:23:33 2002, os: Unix
            [M8ram@localhost bin]$ file striphtml.gz
            striphtml.gz: perl script text executable
            -rwx------ 1 M8ram M8ram 11k Dec 20 23:56 striphtml
            -rw-rw-r-- 1 M8ram M8ram 3.7k Dec 24 13:23
            striphtml-compressed.tar.gz
            -rw------- 1 M8ram M8ram 11k Dec 18 11:07 striphtml.gz

            Could it be that the file was compressed on the server but that my
            browser (opera 6.11) uncompressed it before displaying? I saved the file
            to my hd after reading it with my browser and kept the extension because
            I thought it was necessary...

            The renamed version works fine except for the part about replacing the
            '<br>' tags with newlines.

            Thanks, another mystery solved! :)

            p.s. I only replied this late because your reply (and several other
            messages) never made it to my mailbox...
            --
            # Mertens Bram "M8ram" <m8ram.list@...> Linux User #249103 #
            # Red Hat Linux release 7.3 (Valhalla) kernel 2.4.18-3 i686 128MB RAM #
            # 1:30pm up 8 days, 3:35, 1 user, load average: 0.13, 0.13, 0.10 #
          • Mertens Bram
            ... Sorry, I believe you misunderstood me. I now have three scripts: two perl-scripts that do the actual work: replace_br.pl and striphtml.pl and one
            Message 5 of 8 , Dec 24, 2002
            • 0 Attachment
              On Tue, 2002-12-24 at 12:40, Mystik Gotan wrote:
              > No, change the filename, like:
              >
              > $filename =~ s/\.htm$/\.txt$/;
              >
              > And delete all HTML tags, if you wish. Also, I'd change <br> to \n, for
              > readibility purposes.
              >
              > $content =~ s/\b\<br\>/\n/;
              > $content =~ s/\b\<\w+\>//;
              >
              > I don't think this exactly regex will work, but something LIKE this will.
              > You might need to contact the more advanced Regexers ;)

              Sorry, I believe you misunderstood me.

              I now have three scripts:
              two perl-scripts that do the actual work:
              'replace_br.pl' and 'striphtml.pl'
              and one shell-script that combines these two:
              'html2txt'

              'replace_br.pl' looks like this (slightly modified since the first
              post):
              ===============================
              #!/usr/bin/perl

              my $record = '';
              while (<>) {
              s/<br>/\n/gi;
              $record .= $_;
              }
              print "$record\n" if $record;
              ===============================

              the 'striphtml.pl can be found at 'http://www.perl.com/CPAN/authors/
              Tom_Christiansen/scripts/striphtml.gz'

              and the essence of the shell-script looks like:
              ===============================================================================
              /home/M8ram/bin/replace_br.pl < ${HTML} | /home/M8ram/bin/striphtml.pl > ${TXT}
              ===============================================================================

              where the variables ${HTML} and ${TXT} are the original html-file and
              the new text-file.

              However I would like to edit 'striphtml.pl' so that it also performs the
              action of 'replace_br.pl' so I can get rid of the two other scripts.

              TIA
              --
              # Mertens Bram "M8ram" <m8ram.list@...> Linux User #249103 #
              # Red Hat Linux release 7.3 (Valhalla) kernel 2.4.18-3 i686 128MB RAM #
              # 2:31pm up 8 days, 4:36, 1 user, load average: 0.06, 0.14, 0.10 #
            • Mystik Gotan
              Well, maybe you can define a subroutine: replace_br and put your replace_br.pl up there. And then simply call the subroutine :-) ... Bob Erinkveld (Webmaster
              Message 6 of 8 , Dec 24, 2002
              • 0 Attachment
                Well, maybe you can define a subroutine: replace_br and put your
                replace_br.pl up there. And then simply call the subroutine :-)

                --------------
                Bob Erinkveld (Webmaster Insane Hosts)
                www.insane-hosts.net
                MSN: gotan2k3@...




                >From: Mertens Bram <m8ram.list@...>
                >Reply-To: perl-beginner@yahoogroups.com
                >To: perl-beginner mailing list <perl-beginner@yahoogroups.com>
                >Subject: Re: [PBML] converting html to txt
                >Date: 24 Dec 2002 14:32:26 +0000
                >
                >On Tue, 2002-12-24 at 12:40, Mystik Gotan wrote:
                > > No, change the filename, like:
                > >
                > > $filename =~ s/\.htm$/\.txt$/;
                > >
                > > And delete all HTML tags, if you wish. Also, I'd change <br> to \n, for
                > > readibility purposes.
                > >
                > > $content =~ s/\b\<br\>/\n/;
                > > $content =~ s/\b\<\w+\>//;
                > >
                > > I don't think this exactly regex will work, but something LIKE this
                >will.
                > > You might need to contact the more advanced Regexers ;)
                >
                >Sorry, I believe you misunderstood me.
                >
                >I now have three scripts:
                >two perl-scripts that do the actual work:
                >'replace_br.pl' and 'striphtml.pl'
                >and one shell-script that combines these two:
                >'html2txt'
                >
                >'replace_br.pl' looks like this (slightly modified since the first
                >post):
                >===============================
                >#!/usr/bin/perl
                >
                >my $record = '';
                >while (<>) {
                > s/<br>/\n/gi;
                > $record .= $_;
                >}
                >print "$record\n" if $record;
                >===============================
                >
                >the 'striphtml.pl can be found at 'http://www.perl.com/CPAN/authors/
                >Tom_Christiansen/scripts/striphtml.gz'
                >
                >and the essence of the shell-script looks like:
                >===============================================================================
                >/home/M8ram/bin/replace_br.pl < ${HTML} | /home/M8ram/bin/striphtml.pl >
                >${TXT}
                >===============================================================================
                >
                >where the variables ${HTML} and ${TXT} are the original html-file and
                >the new text-file.
                >
                >However I would like to edit 'striphtml.pl' so that it also performs the
                >action of 'replace_br.pl' so I can get rid of the two other scripts.
                >
                >TIA
                >--
                > # Mertens Bram "M8ram" <m8ram.list@...> Linux User #249103 #
                > # Red Hat Linux release 7.3 (Valhalla) kernel 2.4.18-3 i686 128MB RAM #
                > # 2:31pm up 8 days, 4:36, 1 user, load average: 0.06, 0.14, 0.10 #
                >


                _________________________________________________________________
                Ontvang je Hotmail & Messenger berichten op je mobiele telefoon met Hotmail
                SMS http://www.msn.nl/jumppage/
              • Mertens Bram
                ... I started to work on this and just tried to add my regex between the section that removes the comments and the section that removes the tags. Nothing else
                Message 7 of 8 , Dec 24, 2002
                • 0 Attachment
                  On Tue, 2002-12-24 at 15:48, Mystik Gotan wrote:
                  > Well, maybe you can define a subroutine: replace_br and put your
                  > replace_br.pl up there. And then simply call the subroutine :-)

                  I started to work on this and just tried to add my regex between the
                  section that removes the comments and the section that removes the tags.

                  Nothing else was needed!

                  The scripts works almost perfectly...
                  But I'd like to understand why it does...

                  The script starts with:
                  #!/usr/bin/perl -p0777

                  According to a "quick Reference guide" I found on the net the -p option
                  "assumes an input loop around the script. Lines are printed"
                  Can anybody explain this to me? Or point me in a direction where I can
                  find info on this?
                  And what does the 0777 mean|do?

                  The script has the following at the top:
                  require 5.002; # for nifty embedded regexp comments

                  Does this simply mean you need the perl-interpreter version 5.002?

                  Thanks
                  --
                  # Mertens Bram "M8ram" <m8ram.list@...> Linux User #249103 #
                  # Red Hat Linux release 7.3 (Valhalla) kernel 2.4.18-3 i686 128MB RAM #
                  # 5:38pm up 8 days, 7:42, 1 user, load average: 0.05, 0.05, 0.00 #
                • Hans Ginzel
                  ... Or you can use an external program for doing the conversion (with tables and frames): To use w3m to translate HTML files: $ cat foo.html | w3m -T text/html
                  Message 8 of 8 , Jan 2, 2003
                  • 0 Attachment
                    On Tue, Dec 24, 2002 at 04:48:08PM +0100, Mystik Gotan wrote:
                    > Well, maybe you can define a subroutine: replace_br and put your
                    > replace_br.pl up there. And then simply call the subroutine :-)

                    Or you can use an external program for doing the conversion (with
                    tables and frames):

                    To use w3m to translate HTML files:
                    $ cat foo.html | w3m -T text/html
                    or
                    $ cat foo.html | w3m -dump -T text/html >foo.txt

                    Thats in shell. For perl you can use an example I posted here two
                    weeks ago with subject `Executing using system()' demonstraiting
                    piping tar with gzip from perl (man perlopentut;
                    perldoc -f open; man w3m;).

                    If you would like to open a bidirectional pipe, the IPC::Open2 library
                    will handle this for you. Check out "Bidirectional Communication with
                    Another Process" in perlipc.

                    Happy New Year

                    Hans
                  Your message has been successfully submitted and would be delivered to recipients shortly.