Loading ...
Sorry, an error occurred while loading the content.

Re: [Clip] Trouble with UTF

Expand Messages
  • Axel Berger
    ... Thanks Loro, I had of course looked there but overlooked that (to me) cryptic option. But what about ... I want no conversion and opening files as-is, as
    Message 1 of 12 , Jan 11, 2012
    • 0 Attachment
      loro wrote:
      > You can turn it off in Options, the General tab
      > Protect Unicode Files:

      Thanks Loro, I had of course looked there but overlooked that (to me)
      cryptic option. But what about

      > As a result, the conversion process may drop non-ANSI and
      > cause the loss of information.

      I want no conversion and opening files as-is, as the "UTF-8 (no
      conversion)" setting in the open dialog would do, and do all conversions
      myself. At the very least I do not want to simply lose something.

      Danke
      Axel

      P.S: I still want a warning message. Sometimes I have set files on disk
      to read-only myself and don't think about it, and i hate it, when
      editing just refuses to work. Other programs allow full editing and only
      remap "save" to "save as" in these cases. A much better solution IMHO.

      --
      Dipl.-Ing. F. Axel Berger Tel: +49/ 2174/ 7439 07
      Johann-Häck-Str. 14 Fax: +49/ 2174/ 7439 68
      D-51519 Odenthal-Heide eMail: Axel-Berger@...
      Deutschland (Germany) http://berger-odenthal.de
    • Art Kocsis
      The first thing I thought of is that there is that big difference between turning the protect option off vs loading a document as Unicode and then turning the
      Message 2 of 12 , Jan 11, 2012
      • 0 Attachment
        The first thing I thought of is that there is that big difference between
        turning the protect option off vs loading a document as Unicode and then
        turning the read-only off. Even after editing a Unicode doc and saving it,
        the Unicode characters are still intact but they would be lost if the doc
        was loaded as ANSI.

        BTW, Axel, will your clip be available after you get finished tweaking it?

        Art


        At 1/11/2012 11:20 AM, Axel wrote:
        >loro wrote:
        > > You can turn it off in Options, the General tab
        > > Protect Unicode Files:
        >
        >Thanks Loro, I had of course looked there but overlooked that (to me)
        >cryptic option. But what about
        >
        > > As a result, the conversion process may drop non-ANSI and
        > > cause the loss of information.

        <snip>

        >Axel
        >
        >P.S: I still want a warning message. Sometimes I have set files on disk
        >to read-only myself and don't think about it, and i hate it, when
        >editing just refuses to work. Other programs allow full editing and only
        >remap "save" to "save as" in these cases. A much better solution IMHO.
      • Axel Berger
        ... There was a very silly typing mistake in the first draft that didn t come to light until I first came upon non-ANSI characters. This seems to ... ^!Find
        Message 3 of 12 , Jan 11, 2012
        • 0 Attachment
          Art Kocsis wrote:
          > will your clip be available after you get finished tweaking it?

          There was a very silly typing mistake in the first draft that didn't
          come to light until I first came upon non-ANSI characters. This seems to
          work (many ^!Set lines are long):

          :loop
          ^!Find "[\xC0-\xF7][\x80-\xBF]*" RS
          ^!IfError donelatin
          ^!IfMatch "[\xC2-\xC3][\x80-\xBF]" "^$GetSelection$" latin1
          ^!IfMatch "[\xC0-\xDF][\x80-\xBF]" "^$GetSelection$" zwei
          ^!IfMatch "[\xE0-\xEF][\x80-\xBF]{2}" "^$GetSelection$" drei
          ^!IfMatch "[\xF0-\xF7][\x80-\xBF]{3}" "^$GetSelection$" vier
          ^!Continue Illegal sequence, can't be converted.
          ^!Goto loop
          :zwei
          ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
          64)$
          ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
          32)$
          ^!Set %third%=0
          ^!Set %fourth%=0
          ^!Goto makeent
          :drei
          ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";3)$)$ MOD
          64)$
          ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
          64)$
          ^!Set %third%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
          16)$
          ^!Set %fourth%=0
          ^!Goto makeent
          :vier
          ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";4)$)$ MOD
          64)$
          ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";3)$)$ MOD
          64)$
          ^!Set %third%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
          64)$
          ^!Set %fourth%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
          8)$
          :makeent
          ^!Set
          %first%=^$Calc(262144*^%fourth%+4096*^%third%+64*^%second%+^%first%;0)$
          ^!InsertText &#^%first%;
          ^!Goto loop
          :latin1
          ^!Set %first%=^$StrCopyRight("^$GetSelection$";1)$
          ^!Set %second%=^$StrCopyLeft("^$GetSelection$";1)$
          ^!Set %first%=^$Calc(^$CharToDec(^%first%)$ MOD 64)$
          ^!Set %second%=^$Calc(^$CharToDec(^%second%)$ MOD 4)$
          ^!InsertText ^$DecToChar(^$Calc(64*^%second%+^%first%)$)$
          ^!Goto loop
          :donelatin
          ^!Replace "€" >> "€" WASTI
          ^!Replace "Š" >> "Š" WASTI
          ^!Replace "š" >> "š" WASTI
          ^!Replace "Ž" >> "Ž" WASTI
          ^!Replace "ž" >> "ž" WASTI
          ^!Replace "Œ" >> "Œ" WASTI
          ^!Replace "œ" >> "œ" WASTI
          ^!Replace "Ÿ" >> "Ÿ" WASTI

          It is advisable to check for legal UTF-8, i.e. no non-UTF 8-bit
          characters, first:

          :loop
          ^!Find "([\x80-\xBF]|[\xC0-\xFF][\x80-\xBF]*)" RS
          ^!IfError usasc
          ^!IfMatch "[\xC2-\xDF][\x80-\xBF]" "^$GetSelection$" loop
          ^!IfMatch "\xE0[\xA0-\xBF][\x80-\xBF]" "^$GetSelection$" loop
          ^!IfMatch "[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}" "^$GetSelection$" loop
          ^!IfMatch "\xED[\x80-\x9F][\x80-\xBF]" "^$GetSelection$" loop
          ^!IfMatch "\xF0[\x90-\xBF][\x80-\xBF]{2}" "^$GetSelection$" loop
          ^!IfMatch "[\xF1-\xF3][\x80-\xBF]{3}" "^$GetSelection$" loop
          ^!IfMatch "\xF4[\x80-\x8F][\x80-\xBF]{2}" "^$GetSelection$" loop
          ^!Continue Illegal sequence, no UTF-8
          ^!Goto loop
          :usasc
          ^!Continue No errors found

          Both clips do not start with a ^!Jump TEXT_START and begin at the
          current cursor position. This is on purpose, but you might want to
          change it.

          Axel

          --
          Dipl.-Ing. F. Axel Berger Tel: +49/ 2174/ 7439 07
          Johann-Häck-Str. 14 Fax: +49/ 2174/ 7439 68
          D-51519 Odenthal-Heide eMail: Axel-Berger@...
          Deutschland (Germany) http://berger-odenthal.de
        • Art Kocsis
          Axel, Thanks for the clip. It will be interesting to analyze it in detail which I will have time for later. After a quick scan it looks like you still have a
          Message 4 of 12 , Jan 12, 2012
          • 0 Attachment
            Axel,

            Thanks for the clip. It will be interesting to analyze it in detail which I
            will have time for later.
            After a quick scan it looks like you still have a Unicode document, i.e.,
            two bytes per character. Since my Unicode source docs, such as Windows
            registry exports or web text, I want to just map into the ANSI character
            set and delete the upper byte. Setting NTB's option to not protect Unicode
            (as loro suggested) will probably work for me.

            Just in case you are not aware of it, Andrew West's Babelstone site
            [http://www.babelstone.co.uk/index.html%5d has an online and a freeware
            mapper for all the 110,116 (and counting!) Unicode chars, a freeware
            Unicode editor, a three hour(!) slide show of all the glyphs as well as
            LOTS of other interesting goodies. One could spend hours reading just one
            of his blogs, such as "Mani Stones in Mani Scripts"
            [http://babelstone.blogspot.com/2006/11/mani-stones-in-many-scripts.html%5d.
            The detail, depth and extent of the info presented is captivating and
            amazing (especially to anyone who has ever chanted the Sanskrit mantra: om
            mani padme hum).

            I also found it interesting to see your bilingual roots showing in your
            code as evidenced by the mixture of German and English labels. It works
            well. When I was in Germany a few years ago some of my (very poorly
            remembered), linguistic training came out as well. I found my self counting
            eins, zwei, drei, cuatro, cinco, seis! My HS & college teachers would be
            proud<g>. (Or turning over in their graves!)

            Art


            At 1/11/2012 11:08 PM, Axel wrote:
            >Art Kocsis wrote:
            > > will your clip be available after you get finished tweaking it?
            >
            >There was a very silly typing mistake in the first draft that didn't
            >come to light until I first came upon non-ANSI characters. This seems to
            >work (many ^!Set lines are long):
            <snip>
            >Axel
          • Axel Berger
            ... No. I map everything to ANSI than can be converted (label latin-1), which is less than 128 characters. All the rest can t be mapped so I convert all that
            Message 5 of 12 , Jan 13, 2012
            • 0 Attachment
              Art Kocsis wrote:
              > After a quick scan it looks like you still have a Unicode
              > document, i.e., two bytes per character.

              No. I map everything to ANSI than can be converted (label latin-1),
              which is less than 128 characters. All the rest can't be mapped so I
              convert all that to HTML entities like –

              Axel
            • Art Kocsis
              OK, after a longer scan I see better what you are doing. However, my question/statement still stands. Just to make sure we are on the same page I am defining a
              Message 6 of 12 , Jan 13, 2012
              • 0 Attachment
                OK, after a longer scan I see better what you are doing.
                However, my question/statement still stands.

                Just to make sure we are on the same page I am defining
                a Unicode document as one that uses two bytes to encode
                each character. Most of the time (for western documents),
                the high byte is zero.

                When I load, edit and save a Unicode document, it retains
                the dual byte format. The mapping is 100% to the ANSI
                character set but it still uses two bytes per character. This
                holds whether I turn Unicode protection off or if I leave it on
                and manually remove the read-only flag. Have you looked at
                your files with a binary viewer to verify a single or dual byte
                encoding format?

                Your clip maps characters with non-zero high bytes to the
                127 character Latin-1 set but still (I think), retains the dual
                byte encoding. Since the high byte in my files are already
                all zeros, what I want to do is just delete the high byte.

                Since turning Unicode protection didn't help me, I thought
                modifying your clip code would do the trick: simply replace
                every dual character with its low byte contents but it doesn't
                work.

                FIND reported zero matches for all variations of "\x00[\x00-\xff]",
                "[\x00][\x00-\xff]" and "[\x00-\x00][\x00-\xff]" yet the source
                file strictly consisted of alternating zero and non-zero bytes.
                What are your NTB option settings and how do you load your
                files to be able to "see" the high byte?

                Unless I can come up with a better way I suppose this would work:

                Load the file,
                Capture the full file name
                Copy all to the clipboard,
                Close the file,
                Paste to a new doc
                Save the doc to the captured file name

                but it seems so inelegant and inefficient.

                Art

                At 1/13/2012 01:44 AM, Axel wrote:
                >Art Kocsis wrote:
                > > After a quick scan it looks like you still have a Unicode
                > > document, i.e., two bytes per character.
                >
                >No. I map everything to ANSI than can be converted (label latin-1),
                >which is less than 128 characters. All the rest can't be mapped so I
                >convert all that to HTML entities like –
              Your message has been successfully submitted and would be delivered to recipients shortly.