Loading ...
Sorry, an error occurred while loading the content.

Re: [Clip] Trouble with UTF

Expand Messages
  • Art Kocsis
    The first thing I thought of is that there is that big difference between turning the protect option off vs loading a document as Unicode and then turning the
    Message 1 of 12 , Jan 11, 2012
    • 0 Attachment
      The first thing I thought of is that there is that big difference between
      turning the protect option off vs loading a document as Unicode and then
      turning the read-only off. Even after editing a Unicode doc and saving it,
      the Unicode characters are still intact but they would be lost if the doc
      was loaded as ANSI.

      BTW, Axel, will your clip be available after you get finished tweaking it?

      Art


      At 1/11/2012 11:20 AM, Axel wrote:
      >loro wrote:
      > > You can turn it off in Options, the General tab
      > > Protect Unicode Files:
      >
      >Thanks Loro, I had of course looked there but overlooked that (to me)
      >cryptic option. But what about
      >
      > > As a result, the conversion process may drop non-ANSI and
      > > cause the loss of information.

      <snip>

      >Axel
      >
      >P.S: I still want a warning message. Sometimes I have set files on disk
      >to read-only myself and don't think about it, and i hate it, when
      >editing just refuses to work. Other programs allow full editing and only
      >remap "save" to "save as" in these cases. A much better solution IMHO.
    • Axel Berger
      ... There was a very silly typing mistake in the first draft that didn t come to light until I first came upon non-ANSI characters. This seems to ... ^!Find
      Message 2 of 12 , Jan 11, 2012
      • 0 Attachment
        Art Kocsis wrote:
        > will your clip be available after you get finished tweaking it?

        There was a very silly typing mistake in the first draft that didn't
        come to light until I first came upon non-ANSI characters. This seems to
        work (many ^!Set lines are long):

        :loop
        ^!Find "[\xC0-\xF7][\x80-\xBF]*" RS
        ^!IfError donelatin
        ^!IfMatch "[\xC2-\xC3][\x80-\xBF]" "^$GetSelection$" latin1
        ^!IfMatch "[\xC0-\xDF][\x80-\xBF]" "^$GetSelection$" zwei
        ^!IfMatch "[\xE0-\xEF][\x80-\xBF]{2}" "^$GetSelection$" drei
        ^!IfMatch "[\xF0-\xF7][\x80-\xBF]{3}" "^$GetSelection$" vier
        ^!Continue Illegal sequence, can't be converted.
        ^!Goto loop
        :zwei
        ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
        64)$
        ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
        32)$
        ^!Set %third%=0
        ^!Set %fourth%=0
        ^!Goto makeent
        :drei
        ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";3)$)$ MOD
        64)$
        ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
        64)$
        ^!Set %third%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
        16)$
        ^!Set %fourth%=0
        ^!Goto makeent
        :vier
        ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";4)$)$ MOD
        64)$
        ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";3)$)$ MOD
        64)$
        ^!Set %third%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
        64)$
        ^!Set %fourth%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
        8)$
        :makeent
        ^!Set
        %first%=^$Calc(262144*^%fourth%+4096*^%third%+64*^%second%+^%first%;0)$
        ^!InsertText &#^%first%;
        ^!Goto loop
        :latin1
        ^!Set %first%=^$StrCopyRight("^$GetSelection$";1)$
        ^!Set %second%=^$StrCopyLeft("^$GetSelection$";1)$
        ^!Set %first%=^$Calc(^$CharToDec(^%first%)$ MOD 64)$
        ^!Set %second%=^$Calc(^$CharToDec(^%second%)$ MOD 4)$
        ^!InsertText ^$DecToChar(^$Calc(64*^%second%+^%first%)$)$
        ^!Goto loop
        :donelatin
        ^!Replace "€" >> "€" WASTI
        ^!Replace "Š" >> "Š" WASTI
        ^!Replace "š" >> "š" WASTI
        ^!Replace "Ž" >> "Ž" WASTI
        ^!Replace "ž" >> "ž" WASTI
        ^!Replace "Œ" >> "Œ" WASTI
        ^!Replace "œ" >> "œ" WASTI
        ^!Replace "Ÿ" >> "Ÿ" WASTI

        It is advisable to check for legal UTF-8, i.e. no non-UTF 8-bit
        characters, first:

        :loop
        ^!Find "([\x80-\xBF]|[\xC0-\xFF][\x80-\xBF]*)" RS
        ^!IfError usasc
        ^!IfMatch "[\xC2-\xDF][\x80-\xBF]" "^$GetSelection$" loop
        ^!IfMatch "\xE0[\xA0-\xBF][\x80-\xBF]" "^$GetSelection$" loop
        ^!IfMatch "[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}" "^$GetSelection$" loop
        ^!IfMatch "\xED[\x80-\x9F][\x80-\xBF]" "^$GetSelection$" loop
        ^!IfMatch "\xF0[\x90-\xBF][\x80-\xBF]{2}" "^$GetSelection$" loop
        ^!IfMatch "[\xF1-\xF3][\x80-\xBF]{3}" "^$GetSelection$" loop
        ^!IfMatch "\xF4[\x80-\x8F][\x80-\xBF]{2}" "^$GetSelection$" loop
        ^!Continue Illegal sequence, no UTF-8
        ^!Goto loop
        :usasc
        ^!Continue No errors found

        Both clips do not start with a ^!Jump TEXT_START and begin at the
        current cursor position. This is on purpose, but you might want to
        change it.

        Axel

        --
        Dipl.-Ing. F. Axel Berger Tel: +49/ 2174/ 7439 07
        Johann-Häck-Str. 14 Fax: +49/ 2174/ 7439 68
        D-51519 Odenthal-Heide eMail: Axel-Berger@...
        Deutschland (Germany) http://berger-odenthal.de
      • Art Kocsis
        Axel, Thanks for the clip. It will be interesting to analyze it in detail which I will have time for later. After a quick scan it looks like you still have a
        Message 3 of 12 , Jan 12, 2012
        • 0 Attachment
          Axel,

          Thanks for the clip. It will be interesting to analyze it in detail which I
          will have time for later.
          After a quick scan it looks like you still have a Unicode document, i.e.,
          two bytes per character. Since my Unicode source docs, such as Windows
          registry exports or web text, I want to just map into the ANSI character
          set and delete the upper byte. Setting NTB's option to not protect Unicode
          (as loro suggested) will probably work for me.

          Just in case you are not aware of it, Andrew West's Babelstone site
          [http://www.babelstone.co.uk/index.html%5d has an online and a freeware
          mapper for all the 110,116 (and counting!) Unicode chars, a freeware
          Unicode editor, a three hour(!) slide show of all the glyphs as well as
          LOTS of other interesting goodies. One could spend hours reading just one
          of his blogs, such as "Mani Stones in Mani Scripts"
          [http://babelstone.blogspot.com/2006/11/mani-stones-in-many-scripts.html%5d.
          The detail, depth and extent of the info presented is captivating and
          amazing (especially to anyone who has ever chanted the Sanskrit mantra: om
          mani padme hum).

          I also found it interesting to see your bilingual roots showing in your
          code as evidenced by the mixture of German and English labels. It works
          well. When I was in Germany a few years ago some of my (very poorly
          remembered), linguistic training came out as well. I found my self counting
          eins, zwei, drei, cuatro, cinco, seis! My HS & college teachers would be
          proud<g>. (Or turning over in their graves!)

          Art


          At 1/11/2012 11:08 PM, Axel wrote:
          >Art Kocsis wrote:
          > > will your clip be available after you get finished tweaking it?
          >
          >There was a very silly typing mistake in the first draft that didn't
          >come to light until I first came upon non-ANSI characters. This seems to
          >work (many ^!Set lines are long):
          <snip>
          >Axel
        • Axel Berger
          ... No. I map everything to ANSI than can be converted (label latin-1), which is less than 128 characters. All the rest can t be mapped so I convert all that
          Message 4 of 12 , Jan 13, 2012
          • 0 Attachment
            Art Kocsis wrote:
            > After a quick scan it looks like you still have a Unicode
            > document, i.e., two bytes per character.

            No. I map everything to ANSI than can be converted (label latin-1),
            which is less than 128 characters. All the rest can't be mapped so I
            convert all that to HTML entities like –

            Axel
          • Art Kocsis
            OK, after a longer scan I see better what you are doing. However, my question/statement still stands. Just to make sure we are on the same page I am defining a
            Message 5 of 12 , Jan 13, 2012
            • 0 Attachment
              OK, after a longer scan I see better what you are doing.
              However, my question/statement still stands.

              Just to make sure we are on the same page I am defining
              a Unicode document as one that uses two bytes to encode
              each character. Most of the time (for western documents),
              the high byte is zero.

              When I load, edit and save a Unicode document, it retains
              the dual byte format. The mapping is 100% to the ANSI
              character set but it still uses two bytes per character. This
              holds whether I turn Unicode protection off or if I leave it on
              and manually remove the read-only flag. Have you looked at
              your files with a binary viewer to verify a single or dual byte
              encoding format?

              Your clip maps characters with non-zero high bytes to the
              127 character Latin-1 set but still (I think), retains the dual
              byte encoding. Since the high byte in my files are already
              all zeros, what I want to do is just delete the high byte.

              Since turning Unicode protection didn't help me, I thought
              modifying your clip code would do the trick: simply replace
              every dual character with its low byte contents but it doesn't
              work.

              FIND reported zero matches for all variations of "\x00[\x00-\xff]",
              "[\x00][\x00-\xff]" and "[\x00-\x00][\x00-\xff]" yet the source
              file strictly consisted of alternating zero and non-zero bytes.
              What are your NTB option settings and how do you load your
              files to be able to "see" the high byte?

              Unless I can come up with a better way I suppose this would work:

              Load the file,
              Capture the full file name
              Copy all to the clipboard,
              Close the file,
              Paste to a new doc
              Save the doc to the captured file name

              but it seems so inelegant and inefficient.

              Art

              At 1/13/2012 01:44 AM, Axel wrote:
              >Art Kocsis wrote:
              > > After a quick scan it looks like you still have a Unicode
              > > document, i.e., two bytes per character.
              >
              >No. I map everything to ANSI than can be converted (label latin-1),
              >which is less than 128 characters. All the rest can't be mapped so I
              >convert all that to HTML entities like –
            Your message has been successfully submitted and would be delivered to recipients shortly.