Loading ...
Sorry, an error occurred while loading the content.

Re: [PBML] html2text

Expand Messages
  • Richard Carver
    ... What do you mean by preserve the format ? What kind of format are you trying to preserve? HTML defines the format. If you strip it out, you have no
    Message 1 of 4 , Feb 28, 2003
    • 0 Attachment
      > I wanted to convert HTML files to plain text keeping much of the formatting.
      > I have tried using the HTML::TokeParser subroutine get_trimmed_text, but was
      > not able to preserve the format and also some junk was able to get through.

      What do you mean by "preserve the format"? What kind of format are you trying
      to preserve? HTML defines the format. If you strip it out, you have no format.
    • Brian Gordon
      he means the line spacing and spaces etc. like HTML DOCUMENT texttext image image
      Message 2 of 4 , Mar 1, 2003
      • 0 Attachment
        he means the line spacing and spaces etc. like

        HTML DOCUMENT
        texttext image

        image texttext

        would come out to be

        TEXT DOCUMENT
        texttext

        texttext


        i doubt its possible, barring the use of thousands of complex regex
        ----- Original Message -----
        From: Richard Carver
        To: perl-beginner@yahoogroups.com
        Sent: Friday, February 28, 2003 6:42 PM
        Subject: Re: [PBML] html2text


        > I wanted to convert HTML files to plain text keeping much of the formatting.
        > I have tried using the HTML::TokeParser subroutine get_trimmed_text, but was
        > not able to preserve the format and also some junk was able to get through.

        What do you mean by "preserve the format"? What kind of format are you trying
        to preserve? HTML defines the format. If you strip it out, you have no format.


        Yahoo! Groups Sponsor
        ADVERTISEMENT




        Unsubscribing info is here: http://help.yahoo.com/help/us/groups/groups-32.html

        Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.



        [Non-text portions of this message have been removed]
      • Richard Carver
        I haven t used these but you might check them out: - An html parser in perl which will also convert HTML to plain text:
        Message 3 of 4 , Mar 1, 2003
        • 0 Attachment
          I haven't used these but you might check them out:

          - An html parser in perl which will also convert HTML to plain text:
          http://sunsite.computing.dcu.ie/pub/web-tools/html-parser/html-parser.tar.Z

          - HTML to text converter Version 1.01
          http://www.ftls.org/en/examples/perl-tools/html2txt.shtml

          Regards,
          Rich


          > he means the line spacing and spaces etc. like
          >
          > HTML DOCUMENT
          > texttext image
          >
          > image texttext
          >
          > would come out to be
          >
          > TEXT DOCUMENT
          > texttext
          >
          > texttext
          >
          >
          > i doubt its possible, barring the use of thousands of complex regex
          > ----- Original Message -----
          > From: Richard Carver
          > To: perl-beginner@yahoogroups.com
          > Sent: Friday, February 28, 2003 6:42 PM
          > Subject: Re: [PBML] html2text
          >
          >
          > > I wanted to convert HTML files to plain text keeping much of the formatting.
          > > I have tried using the HTML::TokeParser subroutine get_trimmed_text, but was
          > > not able to preserve the format and also some junk was able to get through.
          >
          > What do you mean by "preserve the format"? What kind of format are you trying
          > to preserve? HTML defines the format. If you strip it out, you have no format.
          >
          >
          > Yahoo! Groups Sponsor
          > ADVERTISEMENT
          >
          >
          >
          >
          > Unsubscribing info is here: http://help.yahoo.com/help/us/groups/groups-32.html
          >
          > Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
          >
          >
          >
          > [Non-text portions of this message have been removed]
          >
          >
          >
          > Unsubscribing info is here: http://help.yahoo.com/help/us/groups/groups-32.html
          >
          > Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
          >
          >
          >
        • Richard Carver
          And checkout Tom Christiansen s striphtml at CPAN.ORG: # striphtml ( striff tummel ) # tchrist@perl.com # version 1.0: Thu 01 Feb 1996 1:53:31pm MST # version
          Message 4 of 4 , Mar 1, 2003
          • 0 Attachment
            And checkout Tom Christiansen's striphtml at CPAN.ORG:

            # striphtml ("striff tummel")
            # tchrist@...
            # version 1.0: Thu 01 Feb 1996 1:53:31pm MST
            # version 1.1: Sat Feb 3 06:23:50 MST 1996
            # (fix up comments in annoying places)
            #########################################################
            #
            # how to strip out html comments and tags and transform
            # entities in just three -- count 'em three -- substitutions;
            # sed and awk eat your heart out. :-)

            http://www.cpan.org/authors/id/T/TO/TOMC/scripts/striphtml.gz


            Regards,
            Rich


            > I haven't used these but you might check them out:
            >
            > - An html parser in perl which will also convert HTML to plain text:
            > http://sunsite.computing.dcu.ie/pub/web-tools/html-parser/html-parser.tar.Z
            >
            > - HTML to text converter Version 1.01
            > http://www.ftls.org/en/examples/perl-tools/html2txt.shtml
            >
            > Regards,
            > Rich
            >
          Your message has been successfully submitted and would be delivered to recipients shortly.