Loading ...
Sorry, an error occurred while loading the content.

Re: [NH] Clipbook / program to convert Word junk -> HTML

Expand Messages
  • Marcelo Bastos
    ... Hmmm, this looks like an encoding issue. From what I can see, the text seems to be encoded as UTF-8. I offer you a couple solutions: 1. Just leave the text
    Message 1 of 9 , Jan 3, 2007
    • 0 Attachment
      Julie wrote:
      > Hi,
      >
      > Because I'm seeing this as a two step process
      > involving Notetab for code clean up I am posting
      > this here rather than to the off-topic list.
      >
      > I looking for a program / clip library (?) to convert characters such as this:
      >
      > “Good and ill have not changed since yesteryear; nor
      > are they one thing among Elves and Dwarves and
      > another thing among Men. It is man’s part to
      > discern them as much in the Golden
      > Wood as in his own house.�
      > Aragorn to Éomer
      >
      > to HTML. Then the code still needs to be cleaned
      > up, because it was a "Word" nightmare. Character
      > conversion in Notetab doesn't do this for me.
      >
      > Anyone have an idea of what can do this and / or
      > a list of which code converts to what? Some are
      > no brainers, but some of the text is accented,
      > and no clue what vowel or diacritic would go where.
      >
      >

      Hmmm, this looks like an encoding issue. From what I can see, the text
      seems to be encoded as UTF-8. I offer you a couple solutions:

      1. Just leave the text itself as is, and add the line:
      <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
      (might be wrong, might be UTF8 without the hyphen, I'm not sure right now)
      to the <head> section of your HTML file. This should make the text
      readable on your web browser.

      2. Use HTMLTidy to convert it from UTF-8 to another encoding (say,
      iso-8859-1 or win-1252 -- although that last one might be troublesome in
      some contexts). Tidy does a very good job of it.

      Unfortunately, I'm writing this on the road right now (extended New Year
      holiday) and I don't have all my references at hand to give you the
      nitty gritty details.

      Marcelo Bastos
    • Rudolf Horbas
      ... That s what I just wanted to post. ... Here I can help, my New Year holiday is over since yesterday:
      Message 2 of 9 , Jan 3, 2007
      • 0 Attachment
        > 2. Use HTMLTidy to convert it from UTF-8 to another encoding

        That's what I just wanted to post.

        > Unfortunately, I'm writing this on the road right now (extended New Year
        > holiday) and I don't have all my references at hand to give you the
        > nitty gritty details.

        Here I can help, my New Year holiday is over since yesterday:
        http://tidy.sourceforge.net/docs/quickref.html#word-2000

        Instead of making a new config file for NoteTab, I'd suggest to use
        TidyGUI (no longer maintained, but functional):
        http://perso.orange.fr/ablavier/TidyGUI/index.html

        The tab "cleanup" has the option "Source document is from MS Word 2000".

        (You could alternatively just use TidyGUI to save a Tidy.cfg for NoteTab.)

        HTH, and Happy New Year to all!
        Rudi
      • Julie
        ... Are characters like this font dependant? I ve tried all the encoding options, as well as checking the source document is from MS Word 2000 tab, but none
        Message 3 of 9 , Jan 3, 2007
        • 0 Attachment
          At 1/3/2007 10:29 AM, Rudolf Horbas wrote:

          >Here I can help, my New Year holiday is over since yesterday:
          >http://tidy.sourceforge.net/docs/quickref.html#word-2000
          >
          >Instead of making a new config file for NoteTab, I'd suggest to use
          >TidyGUI (no longer maintained, but functional):
          >http://perso.orange.fr/ablavier/TidyGUI/index.html
          >
          >The tab "cleanup" has the option "Source document is from MS Word 2000".

          Are characters like this font dependant? I've tried all the encoding
          options, as well as checking the "source document is from MS Word
          2000" tab, but none of the tries has correctly converted the characters.

          from the http://textism.com/wordcleaner/ site my text from my other
          post translates as

          “Good and ill have not changed since yesteryear; nor are they
          one thing among Elves and Dwarves and another thing among Men. It is
          man’s part to discern them as much in the Golden Wood as in his
          own house.” Aragorn to Éomer

          Which looks fine on preview in the browser. Any helpful hints?
        • loro
          ... It s Word s curly quotes that give you trouble. They are non-standard. You can turn them off in Word. I don t know what happens if you try to turn them
          Message 4 of 9 , Jan 3, 2007
          • 0 Attachment
            Julie wrote:
            >from the http://textism.com/wordcleaner/ site my text from my other
            >post translates as
            >
            >“Good and ill have not changed since yesteryear; nor are they
            >one thing among Elves and Dwarves and another thing among Men. It is
            >man’s part to discern them as much in the Golden Wood as in his
            >own house.” Aragorn to Éomer
            >
            >Which looks fine on preview in the browser. Any helpful hints?

            It's Word's curly quotes that give you trouble. They are non-standard. You
            can turn them off in Word. I don't know what happens if you try to turn
            them off on a document that already has them. Maybe if you turn them off
            and then paste the text into a new document?

            I think it's this:
            Tool | AutorCorrect, then you have them on both the AutoFormat and
            AutoFormat As You Type tabs, "Replace straight quotes with smart quotes".

            Too bad about WordCleaner. It used to be free. :-(

            Lotta
          • Julie
            Hey Lotta ... It s also accented letters like Éomer. Many of these are articles that have been posted in blogs that I ve collected... I can t believe people
            Message 5 of 9 , Jan 3, 2007
            • 0 Attachment
              Hey Lotta

              >It's Word's curly quotes that give you trouble. They are non-standard. You
              >can turn them off in Word. I don't know what happens if you try to turn
              >them off on a document that already has them. Maybe if you turn them off
              >and then paste the text into a new document?

              It's also accented letters like Éomer. Many of
              these are articles that have been posted in blogs
              that I've collected... I can't believe people
              posted that mess! A friend wants to repost them
              cleaned up, so I thought I'd see if there was an easy way to do this. :-)

              >Too bad about WordCleaner. It used to be free. :-(

              The site gives me six uses a day. The potential
              project isn't a rush at least... doesn't matter
              how long it takes, but I have a substantial
              number of articles to convert. Could take a while. LOL

              Julie


              --
              No virus found in this outgoing message.
              Checked by AVG Free Edition.
              Version: 7.5.432 / Virus Database: 268.16.4/615 - Release Date: 1/3/2007 1:34 PM
            • loro
              ... You can do it with Notetab too. Notetab can display the curly quotes and the Replace thingie recognizes them, so you can select one of each kind and do a
              Message 6 of 9 , Jan 3, 2007
              • 0 Attachment
                I wrote:
                >It's Word's curly quotes that give you trouble.

                You can do it with Notetab too. Notetab can display the curly quotes and
                the Replace thingie recognizes them, so you can select one of each kind and
                do a "replace all" with the entity for the corresponding legit curly quote.

                Lotta
              • loro
                ... Ah. The first example came through all jumbled so I went by the second one. ... You could use a proxy. ;-o) Lotta
                Message 7 of 9 , Jan 3, 2007
                • 0 Attachment
                  Julie wrote:
                  > >It's Word's curly quotes that give you trouble.

                  >It's also accented letters like Éomer.

                  Ah. The first example came through all jumbled so I went by the second one.

                  >The site gives me six uses a day. The potential
                  >project isn't a rush at least... doesn't matter
                  >how long it takes, but I have a substantial
                  >number of articles to convert. Could take a while. LOL

                  You could use a proxy. ;-o)

                  Lotta
                • Julie
                  Hey Lotta - ... The thought has crossed my mind. Julie -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.432 / Virus
                  Message 8 of 9 , Jan 3, 2007
                  • 0 Attachment
                    Hey Lotta -

                    >You could use a proxy. ;-o)

                    The thought has crossed my mind. <G>

                    Julie


                    --
                    No virus found in this outgoing message.
                    Checked by AVG Free Edition.
                    Version: 7.5.432 / Virus Database: 268.16.4/615 - Release Date: 1/3/2007 1:34 PM
                  • bruce.somers@web.de
                    Julie wrote: I can t believe people posted that mess! A friend wants to repost them cleaned up, so I thought I d see if there was an easy
                    Message 9 of 9 , Jan 4, 2007
                    • 0 Attachment
                      Julie <gleits@...> wrote:

                      I can't believe people
                      posted that mess! A friend wants to repost them
                      cleaned up, so I thought I'd see if there was an easy way to do this. :-)

                      No, you needn't 't believe that. It's much more likely that some component (program) used by the poster of the blog entry, has replaced what it considered to be non-standard characters, curly quotes, accented characters etc., with their corresponding "escape-codes", because many viewers will not have the character sets needed to display them. Many systems recognize only the extremely provincial and badly limited ASCII character set.

                      It's probably the blog software that is not able to replace those escape-codes with the corresponding characters.

                      Bruce
                    Your message has been successfully submitted and would be delivered to recipients shortly.