Loading ...
Sorry, an error occurred while loading the content.

Transform UTF-8 to ANSI

Expand Messages
  • Axel Berger
    Unless I ve overlooked something, NoteTab can t do this natively, so has anyone written a clip to transform an UTF-8 text file to ANSI? Ideally undefined (for
    Message 1 of 7 , Nov 21, 2011
    • 0 Attachment
      Unless I've overlooked something, NoteTab can't do this natively, so has
      anyone written a clip to transform an UTF-8 text file to ANSI? Ideally
      undefined (for ANSI) codes should be recoded for HTML like "—"
      I'd be glad not to have to start from scratch.
      In case anyone's interested, I've already got a clip to test a file for
      legal UTF-8 encoding. It starts at the cursor stops at any illegal
      sequence:


      :loop
      ^!Find "([\x80-\xBF]|[\xC0-\xFF][\x80-\xBF]*)" RS
      ^!IfError usasc
      ^!IfMatch "[\xC2-\xDF][\x80-\xBF]" "^$GetSelection$" loop
      ^!IfMatch "\xE0[\xA0-\xBF][\x80-\xBF]" "^$GetSelection$" loop
      ^!IfMatch "[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}" "^$GetSelection$" loop
      ^!IfMatch "\xED[\x80-\x9F][\x80-\xBF]" "^$GetSelection$" loop
      ^!IfMatch "\xF0[\x90-\xBF][\x80-\xBF]{2}" "^$GetSelection$" loop
      ^!IfMatch "[\xF1-\xF3][\x80-\xBF]{3}" "^$GetSelection$" loop
      ^!IfMatch "\xF4[\x80-\x8F][\x80-\xBF]{2}" "^$GetSelection$" loop
      ^!Continue no match
      ^!Goto loop
      :usasc
      ^!Continue No errors found

      Danke
      Axel








      --
      Dipl.-Ing. F. Axel Berger Tel: +49/ 2174/ 7439 07
      Johann-Häck-Str. 14 Fax: +49/ 2174/ 7439 68
      D-51519 Odenthal-Heide eMail: Axel-Berger@...
      Deutschland (Germany) http://berger-odenthal.de
    • John Shotsky
      I have something similar, but am not sure it would meet your needs. I code each character separately and run through a document changing those that need it.
      Message 2 of 7 , Nov 21, 2011
      • 0 Attachment
        I have something similar, but am not sure it would meet your needs. I code each character separately and run through a
        document changing those that need it. Mine converts all smart quotes and smart double quotes to 'common' quotes and
        double quotes as well. So it's more of a substitution of what I want for what it finds. It knows all the html codes for
        characters and converts those as well. For instance, it also converts all single-character fractions to standard three+
        character fractions. I can email it to you if interested. With some editing, it may be of some use to you.

        I get files that have originated in Mac and Unix as well as Windows, so conversions are needed.

        However, be aware that NoteTab will display a question mark for thing it doesn't understand, and if you save the file
        you will save the question mark. The Mac character set includes single fractions for eighths, which cannot be understood
        in NoteTab, for example. For files like that, I have another clip library that must be run on the document before
        opening and saving in NoteTab. It converts to ANSI.
        An example line is:
        ^!Replace "(*UTF8)\x{215B}" >> "1/8" RAWS0

        Regards,
        John
        RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/

        From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Axel Berger
        Sent: Monday, November 21, 2011 09:47
        To: NoteTab Clips
        Subject: [Clip] Transform UTF-8 to ANSI


        Unless I've overlooked something, NoteTab can't do this natively, so has
        anyone written a clip to transform an UTF-8 text file to ANSI? Ideally
        undefined (for ANSI) codes should be recoded for HTML like "—"
        I'd be glad not to have to start from scratch.
        In case anyone's interested, I've already got a clip to test a file for
        legal UTF-8 encoding. It starts at the cursor stops at any illegal
        sequence:

        :loop
        ^!Find "([\x80-\xBF]|[\xC0-\xFF][\x80-\xBF]*)" RS
        ^!IfError usasc
        ^!IfMatch "[\xC2-\xDF][\x80-\xBF]" "^$GetSelection$" loop
        ^!IfMatch "\xE0[\xA0-\xBF][\x80-\xBF]" "^$GetSelection$" loop
        ^!IfMatch "[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}" "^$GetSelection$" loop
        ^!IfMatch "\xED[\x80-\x9F][\x80-\xBF]" "^$GetSelection$" loop
        ^!IfMatch "\xF0[\x90-\xBF][\x80-\xBF]{2}" "^$GetSelection$" loop
        ^!IfMatch "[\xF1-\xF3][\x80-\xBF]{3}" "^$GetSelection$" loop
        ^!IfMatch "\xF4[\x80-\x8F][\x80-\xBF]{2}" "^$GetSelection$" loop
        ^!Continue no match
        ^!Goto loop
        :usasc
        ^!Continue No errors found

        Danke
        Axel

        --
        Dipl.-Ing. F. Axel Berger Tel: +49/ 2174/ 7439 07
        Johann-H�ck-Str. 14 Fax: +49/ 2174/ 7439 68
        D-51519 Odenthal-Heide eMail: Axel-Berger@... <mailto:Axel-Berger%40Nexgo.De>
        Deutschland (Germany) http://berger-odenthal.de



        [Non-text portions of this message have been removed]
      • Axel Berger
        ... I m sure it will be. Simple texts I get from others, often as mail, aren t going to be all that sophisticated. It s simply that beginning from scratch is
        Message 3 of 7 , Nov 21, 2011
        • 0 Attachment
          John Shotsky wrote:
          > With some editing, it may be of some use to you.

          I'm sure it will be. Simple texts I get from others, often as mail,
          aren't going to be all that sophisticated. It's simply that beginning
          from scratch is alot of typing at first.

          For single character conversions, byte to byte, I have something else. I
          save 256 bytes as a file(1) and run a pascal program that loads that to
          a byte array and then goes through the text byte by byte replacing n by
          byte(n). It's no racer but fast enough for all practical pursposes.

          (1) That format was used for the same purpose by the Atari editor I used
          previously, Tempus.

          Axel
        • Ian NTnerd
          You don t sound like you need it, but if you want industrial strength encoding conversion look at this non-NT package.
          Message 4 of 7 , Nov 21, 2011
          • 0 Attachment
            You don't sound like you need it, but if you want industrial strength
            encoding conversion look at this non-NT package.
            http://www.sil.org/computing/catalog/show_software.asp?id=120
            Of some use would be the .map files that come with it. Though some .map
            files use the Unicode names not the codes. So it is a good resource for
            creating scripts.

            It might be of interest to some NTers.

            Ian

            On 22/11/2011 1:47 AM, Axel Berger wrote:
            >
            > Unless I've overlooked something, NoteTab can't do this natively, so has
            > anyone written a clip to transform an UTF-8 text file to ANSI? Ideally
            > undefined (for ANSI) codes should be recoded for HTML like "—"
            > I'd be glad not to have to start from scratch.
            > In case anyone's interested, I've already got a clip to test a file for
            > legal UTF-8 encoding. It starts at the cursor stops at any illegal
            > sequence:
            >
            > :loop
            > ^!Find "([\x80-\xBF]|[\xC0-\xFF][\x80-\xBF]*)" RS
            > ^!IfError usasc
            > ^!IfMatch "[\xC2-\xDF][\x80-\xBF]" "^$GetSelection$" loop
            > ^!IfMatch "\xE0[\xA0-\xBF][\x80-\xBF]" "^$GetSelection$" loop
            > ^!IfMatch "[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}" "^$GetSelection$" loop
            > ^!IfMatch "\xED[\x80-\x9F][\x80-\xBF]" "^$GetSelection$" loop
            > ^!IfMatch "\xF0[\x90-\xBF][\x80-\xBF]{2}" "^$GetSelection$" loop
            > ^!IfMatch "[\xF1-\xF3][\x80-\xBF]{3}" "^$GetSelection$" loop
            > ^!IfMatch "\xF4[\x80-\x8F][\x80-\xBF]{2}" "^$GetSelection$" loop
            > ^!Continue no match
            > ^!Goto loop
            > :usasc
            > ^!Continue No errors found
            >
            > Danke
            > Axel
            >
            > --
            > Dipl.-Ing. F. Axel Berger Tel: +49/ 2174/ 7439 07
            > Johann-Häck-Str. 14 Fax: +49/ 2174/ 7439 68
            > D-51519 Odenthal-Heide eMail: Axel-Berger@...
            > <mailto:Axel-Berger%40Nexgo.De>
            > Deutschland (Germany) http://berger-odenthal.de
            >
            >



            [Non-text portions of this message have been removed]
          • Axel Berger
            ... John sent me a small converter for single-charater fractions like ¼ to their three character equivalents like 1/4 by Sheri and a quite comprehensive one
            Message 5 of 7 , Nov 23, 2011
            • 0 Attachment
              John Shotsky wrote:
              > I have something similar,

              John sent me a small converter for single-charater fractions like ¼ to
              their three character equivalents like 1/4 by Sheri and a quite
              comprehensive one for his recipes from diverse sources of his own. Both
              are much tidier than my quick-and-sirty efforts, with comprehensive
              comments, error checking, and clearing of variables. But I've done
              something different now and solved it algorithmicly.

              By the way it's all a bit superfluous, as NoteTab does do it natively
              after all. I just have to save the text and open in NoteTab. Standard
              UTF are shown as ANSI characters and I can save as ANSI. It just does
              not work with my usual method of pasting text into an open, empty, new
              document.

              First off, John, your conversions of "quoted-printable" characters can
              be more generalised thus:
              ^!Replace "=^P" >> "" WASTI
              ^!Replace "=3D" >> "<gleich>" WASTI
              ^!Jump TEXT_START
              :loop
              ^!Find "=[0-9A-Fa-f]{2}" WRASTI
              ;^!Continue
              ^!IfError fini
              ;long line
              ^!InsertText
              ^$DecToChar(^$HexToInt(^$StrCopyRight("^$GetSelection$";2)$)$)$
              ;end long line
              ^!Goto loop
              :fini
              ^!Replace "<gleich>" >> "=" WASTI

              My solution for UTF-8 ist the following. It is not fully tested. UTF-8
              encodes Latin-1 in the first 256 characters, so the eight characters
              where cp-1252 (aka Windows) and Latin-1 differ have to be treated
              specially. Quite a few of the ^!Set lines are long:

              :loop
              ^!Find "[\xC0-\xF7][\x80-\xBF]*" RS
              ^!IfError donelatin
              ^!IfMatch "[\xC2-\xC3][\x80-\xBF]" "^$GetSelection$" latin1
              ^!IfMatch "[\xC0-\xDF][\x80-\xBF]" "^$GetSelection$" zwei
              ^!IfMatch "[\xE0-\xEF][\x80-\xBF]{2}" "^$GetSelection$" drei
              ^!IfMatch "[\xF0-\xF7][\x80-\xBF]{3}" "^$GetSelection$" vier
              ^!Continue Illegal sequence, can't be converted.
              ^!Goto loop
              :zwei
              ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
              64)$
              ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
              32)$
              ^!Set %third%=0
              ^!Set %fourth%=0
              ^!Goto makeent
              :drei
              ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";3)$)$ MOD
              64)$
              ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
              64)$
              ^!Set %third%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
              16)$
              ^!Set %fourth%=0
              ^!Goto makeent
              :vier
              ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";4)$)$ MOD
              64)$
              ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";3)$)$ MOD
              64)$
              ^!Set %third%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
              64)$
              ^!Set %fourth%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
              8)$
              :makeent
              ^!Set
              %first%=^$Calc(262144*^%fourth%+4096*^%third+%64*^%second%+^%first%;0)$
              ^!InsertText &#^%first%;
              ^!Goto loop
              :latin1
              ^!Set %first%=^$StrCopyRight("^$GetSelection$";1)$
              ^!Set %second%=^$StrCopyLeft("^$GetSelection$";1)$
              ^!Set %first%=^$Calc(^$CharToDec(^%first%)$ MOD 64)$
              ^!Set %second%=^$Calc(^$CharToDec(^%second%)$ MOD 4)$
              ^!InsertText ^$DecToChar(^$Calc(64*^%second%+^%first%)$)$
              ^!Goto loop
              :donelatin
              ^!Replace "€" >> "€" WASTI
              ^!Replace "Š" >> "Š" WASTI
              ^!Replace "š" >> "š" WASTI
              ^!Replace "Ž" >> "Ž" WASTI
              ^!Replace "ž" >> "ž" WASTI
              ^!Replace "Œ" >> "Œ" WASTI
              ^!Replace "œ" >> "œ" WASTI
              ^!Replace "Ÿ" >> "Ÿ" WASTI


              Axel
            • John Shotsky
              Thanks, Axel, I may incorporate this into my library. But you know what they say about sleeping dogs… :-) The problems I encounter are often related to
              Message 6 of 7 , Nov 23, 2011
              • 0 Attachment
                Thanks, Axel, I may incorporate this into my library. But you know what they say about sleeping dogs� :-)

                The problems I encounter are often related to someone's incorrect usage of certain characters. For example there are
                several different characters that people use to denote the one and only 'degrees' symbol. So my clips carefully look for
                these incorrect usages and convert them to 'real' degrees symbols. An algorithmic approach would miss those, and then
                I'd still have to have something to make those conversions. There are other such cases, as well, like the already
                mentioned 'smart quotes' needing to be converted to standard quotes.

                Regards,
                John
                RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/

                From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Axel Berger
                Sent: Wednesday, November 23, 2011 05:54
                To: ntb-clips@yahoogroups.com
                Subject: Re: [Clip] Transform UTF-8 to ANSI


                John Shotsky wrote:
                > I have something similar,

                John sent me a small converter for single-charater fractions like � to
                their three character equivalents like 1/4 by Sheri and a quite
                comprehensive one for his recipes from diverse sources of his own. Both
                are much tidier than my quick-and-sirty efforts, with comprehensive
                comments, error checking, and clearing of variables. But I've done
                something different now and solved it algorithmicly.

                By the way it's all a bit superfluous, as NoteTab does do it natively
                after all. I just have to save the text and open in NoteTab. Standard
                UTF are shown as ANSI characters and I can save as ANSI. It just does
                not work with my usual method of pasting text into an open, empty, new
                document.

                First off, John, your conversions of "quoted-printable" characters can
                be more generalised thus:
                ^!Replace "=^P" >> "" WASTI
                ^!Replace "=3D" >> "<gleich>" WASTI
                ^!Jump TEXT_START
                :loop
                ^!Find "=[0-9A-Fa-f]{2}" WRASTI
                ;^!Continue
                ^!IfError fini
                ;long line
                ^!InsertText
                ^$DecToChar(^$HexToInt(^$StrCopyRight("^$GetSelection$";2)$)$)$
                ;end long line
                ^!Goto loop
                :fini
                ^!Replace "<gleich>" >> "=" WASTI

                My solution for UTF-8 ist the following. It is not fully tested. UTF-8
                encodes Latin-1 in the first 256 characters, so the eight characters
                where cp-1252 (aka Windows) and Latin-1 differ have to be treated
                specially. Quite a few of the ^!Set lines are long:

                :loop
                ^!Find "[\xC0-\xF7][\x80-\xBF]*" RS
                ^!IfError donelatin
                ^!IfMatch "[\xC2-\xC3][\x80-\xBF]" "^$GetSelection$" latin1
                ^!IfMatch "[\xC0-\xDF][\x80-\xBF]" "^$GetSelection$" zwei
                ^!IfMatch "[\xE0-\xEF][\x80-\xBF]{2}" "^$GetSelection$" drei
                ^!IfMatch "[\xF0-\xF7][\x80-\xBF]{3}" "^$GetSelection$" vier
                ^!Continue Illegal sequence, can't be converted.
                ^!Goto loop
                :zwei
                ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
                64)$
                ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
                32)$
                ^!Set %third%=0
                ^!Set %fourth%=0
                ^!Goto makeent
                :drei
                ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";3)$)$ MOD
                64)$
                ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
                64)$
                ^!Set %third%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
                16)$
                ^!Set %fourth%=0
                ^!Goto makeent
                :vier
                ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";4)$)$ MOD
                64)$
                ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";3)$)$ MOD
                64)$
                ^!Set %third%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
                64)$
                ^!Set %fourth%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
                8)$
                :makeent
                ^!Set
                %first%=^$Calc(262144*^%fourth%+4096*^%third+%64*^%second%+^%first%;0)$
                ^!InsertText &#^%first%;
                ^!Goto loop
                :latin1
                ^!Set %first%=^$StrCopyRight("^$GetSelection$";1)$
                ^!Set %second%=^$StrCopyLeft("^$GetSelection$";1)$
                ^!Set %first%=^$Calc(^$CharToDec(^%first%)$ MOD 64)$
                ^!Set %second%=^$Calc(^$CharToDec(^%second%)$ MOD 4)$
                ^!InsertText ^$DecToChar(^$Calc(64*^%second%+^%first%)$)$
                ^!Goto loop
                :donelatin
                ^!Replace "€" >> "�" WASTI
                ^!Replace "Š" >> "�" WASTI
                ^!Replace "š" >> "�" WASTI
                ^!Replace "Ž" >> "�" WASTI
                ^!Replace "ž" >> "�" WASTI
                ^!Replace "Œ" >> "�" WASTI
                ^!Replace "œ" >> "�" WASTI
                ^!Replace "Ÿ" >> "�" WASTI

                Axel



                [Non-text portions of this message have been removed]
              • Axel Berger
                John Shotsky wrote: The problems I encounter are often related to someone s ... Yes quite, I have something similar somewhere in my library. My point was
                Message 7 of 7 , Nov 23, 2011
                • 0 Attachment
                  John Shotsky wrote:
                  The problems I encounter are often related to someone's
                  > incorrect usage of certain characters.

                  Yes quite, I have something similar somewhere in my library. My point
                  was limited to the "quoted printable" encoding in mails. But it does
                  entail a loop, which would probably mess up your comprehensive clip. I
                  tend not to have an all-in-one but rather run several clips in sequence.
                  As not everything can be automated several of these end in a ^!Find
                  command allowing me to <F3> through the text and look at all doubtful
                  places.

                  Axel
                Your message has been successfully submitted and would be delivered to recipients shortly.