Loading ...
Sorry, an error occurred while loading the content.
 

Re: [Clip] Trouble with UTF

Expand Messages
  • Art Kocsis
    Axel, A couple of suggestions: You could create you own alert box. Insert a test line at the beginning of the document and then check for error condition or
    Message 1 of 12 , Jan 8 1:43 PM
      Axel,

      A couple of suggestions:

      You could create you own alert box.
      Insert a test line at the beginning of the document and then check
      for error condition or test for its existence, then pop up an alert.

      Using the results of the above test, toggle the read only flag for the
      document: ^!Menu/Document/Read-Only.
      Unfortunately, it is a toggle so you have to test first.

      Instead of opening the Unicode document, load its ANSI text into
      a new tab with ^$GetUnicodeFileText("FileName")$.

      However, this may not work for you as it may destroy the non-ANSI
      characters that you converted to the – form.

      The Unicode problem is a big PITA as more and more docs appear
      in Unicode. NTB really needs a built-in conversion command but I
      am not holding my breath. Updates are few and far between and then
      only minor.

      Art


      At 1/8/2012 12:45 PM, Axel wrote:
      >I'm having serious trouble with the so-called UTF-8 capabilities of
      >NoteTab, so bad, I might even have to go back to version 5.8.
      >
      >I've got a working clip to convert all legal UTF-8 to either ANSI or
      >entities like –. Trouble is, unless I jump through hoops it can't
      >work, as files are loaded read-only.
      >N.B: Whenever that happens I want a really big alert box! All too often
      >have I juggled with a clip that just would not work and not seen that
      >tiny "read only" at the bottom of the window.
    • bruce.somers@web.de
      ... Is NoteTab now orphaned software? Is it still being maintained? Sad, if it is not. Bruce
      Message 2 of 12 , Jan 8 2:20 PM
        > The Unicode problem is a big PITA as more and more docs appear in Unicode. NTB really needs a built-in conversion command but I
        > am not holding my breath. Updates are few and far between and then only minor.

        Is NoteTab now orphaned software? Is it still being maintained? Sad, if it is not.

        Bruce
      • Axel Berger
        ... Thanks! I had looked in vain for just that all over the place and failed to find it. Failing to be able to turn read-only off I had to resort to
        Message 3 of 12 , Jan 9 12:28 AM
          Art Kocsis wrote:
          > ^!Menu/Document/Read-Only.

          Thanks! I had looked in vain for just that all over the place and failed
          to find it. Failing to be able to turn read-only off I had to resort to
          complicated gymnastics using load as. This will really make life easier.
          Why, oh why, is it not in the document properties box?

          Your other two hints are impractical though. The thing is, I'm surprised
          by the read-only where I did not expect it, as when saving a whole web
          page and opening it through a doubleclick.

          > NTB really needs a built-in conversion command

          My simple clip suffices. As long as NoteTab can't really work with
          characters of more than one byte and doesn't offer full UTF capability,
          it shouldn't pretend to do so. UTF opened as is may look strange but can
          be worked on, which is the only important thing.

          Axel

          --
          Dipl.-Ing. F. Axel Berger Tel: +49/ 2174/ 7439 07
          Johann-Häck-Str. 14 Fax: +49/ 2174/ 7439 68
          D-51519 Odenthal-Heide eMail: Axel-Berger@...
          Deutschland (Germany) http://berger-odenthal.de
        • loro
          ... You can turn it off in Options, the General tab ... Protect Unicode Files: When NoteTab opens Unicode files, it has to convert them to the ANSI character
          Message 4 of 12 , Jan 11 11:06 AM
            At 09:28 2012-01-09, Axel Berger wrote:
            >Your other two hints are impractical though. The thing is, I'm surprised
            >by the read-only where I did not expect it, as when saving a whole web
            >page and opening it through a doubleclick.

            You can turn it off in Options, the General tab

            -----
            Protect Unicode Files: When NoteTab opens Unicode files, it has to
            convert them to the ANSI character set, which handles much fewer
            characters. As a result, the conversion process may drop non-ANSI and
            cause the loss of information. When this setting is checked, Unicode
            files are opened in Read-Only mode, which protects the file from
            changes. If you know that your Unicode documents will not loose
            important information during the conversion process, you can uncheck
            this option in order to open such files in editable mode
            -----

            Lotta
          • Axel Berger
            ... Thanks Loro, I had of course looked there but overlooked that (to me) cryptic option. But what about ... I want no conversion and opening files as-is, as
            Message 5 of 12 , Jan 11 11:20 AM
              loro wrote:
              > You can turn it off in Options, the General tab
              > Protect Unicode Files:

              Thanks Loro, I had of course looked there but overlooked that (to me)
              cryptic option. But what about

              > As a result, the conversion process may drop non-ANSI and
              > cause the loss of information.

              I want no conversion and opening files as-is, as the "UTF-8 (no
              conversion)" setting in the open dialog would do, and do all conversions
              myself. At the very least I do not want to simply lose something.

              Danke
              Axel

              P.S: I still want a warning message. Sometimes I have set files on disk
              to read-only myself and don't think about it, and i hate it, when
              editing just refuses to work. Other programs allow full editing and only
              remap "save" to "save as" in these cases. A much better solution IMHO.

              --
              Dipl.-Ing. F. Axel Berger Tel: +49/ 2174/ 7439 07
              Johann-Häck-Str. 14 Fax: +49/ 2174/ 7439 68
              D-51519 Odenthal-Heide eMail: Axel-Berger@...
              Deutschland (Germany) http://berger-odenthal.de
            • Art Kocsis
              The first thing I thought of is that there is that big difference between turning the protect option off vs loading a document as Unicode and then turning the
              Message 6 of 12 , Jan 11 3:38 PM
                The first thing I thought of is that there is that big difference between
                turning the protect option off vs loading a document as Unicode and then
                turning the read-only off. Even after editing a Unicode doc and saving it,
                the Unicode characters are still intact but they would be lost if the doc
                was loaded as ANSI.

                BTW, Axel, will your clip be available after you get finished tweaking it?

                Art


                At 1/11/2012 11:20 AM, Axel wrote:
                >loro wrote:
                > > You can turn it off in Options, the General tab
                > > Protect Unicode Files:
                >
                >Thanks Loro, I had of course looked there but overlooked that (to me)
                >cryptic option. But what about
                >
                > > As a result, the conversion process may drop non-ANSI and
                > > cause the loss of information.

                <snip>

                >Axel
                >
                >P.S: I still want a warning message. Sometimes I have set files on disk
                >to read-only myself and don't think about it, and i hate it, when
                >editing just refuses to work. Other programs allow full editing and only
                >remap "save" to "save as" in these cases. A much better solution IMHO.
              • Axel Berger
                ... There was a very silly typing mistake in the first draft that didn t come to light until I first came upon non-ANSI characters. This seems to ... ^!Find
                Message 7 of 12 , Jan 11 11:08 PM
                  Art Kocsis wrote:
                  > will your clip be available after you get finished tweaking it?

                  There was a very silly typing mistake in the first draft that didn't
                  come to light until I first came upon non-ANSI characters. This seems to
                  work (many ^!Set lines are long):

                  :loop
                  ^!Find "[\xC0-\xF7][\x80-\xBF]*" RS
                  ^!IfError donelatin
                  ^!IfMatch "[\xC2-\xC3][\x80-\xBF]" "^$GetSelection$" latin1
                  ^!IfMatch "[\xC0-\xDF][\x80-\xBF]" "^$GetSelection$" zwei
                  ^!IfMatch "[\xE0-\xEF][\x80-\xBF]{2}" "^$GetSelection$" drei
                  ^!IfMatch "[\xF0-\xF7][\x80-\xBF]{3}" "^$GetSelection$" vier
                  ^!Continue Illegal sequence, can't be converted.
                  ^!Goto loop
                  :zwei
                  ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
                  64)$
                  ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
                  32)$
                  ^!Set %third%=0
                  ^!Set %fourth%=0
                  ^!Goto makeent
                  :drei
                  ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";3)$)$ MOD
                  64)$
                  ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
                  64)$
                  ^!Set %third%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
                  16)$
                  ^!Set %fourth%=0
                  ^!Goto makeent
                  :vier
                  ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";4)$)$ MOD
                  64)$
                  ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";3)$)$ MOD
                  64)$
                  ^!Set %third%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
                  64)$
                  ^!Set %fourth%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
                  8)$
                  :makeent
                  ^!Set
                  %first%=^$Calc(262144*^%fourth%+4096*^%third%+64*^%second%+^%first%;0)$
                  ^!InsertText &#^%first%;
                  ^!Goto loop
                  :latin1
                  ^!Set %first%=^$StrCopyRight("^$GetSelection$";1)$
                  ^!Set %second%=^$StrCopyLeft("^$GetSelection$";1)$
                  ^!Set %first%=^$Calc(^$CharToDec(^%first%)$ MOD 64)$
                  ^!Set %second%=^$Calc(^$CharToDec(^%second%)$ MOD 4)$
                  ^!InsertText ^$DecToChar(^$Calc(64*^%second%+^%first%)$)$
                  ^!Goto loop
                  :donelatin
                  ^!Replace "€" >> "€" WASTI
                  ^!Replace "Š" >> "Š" WASTI
                  ^!Replace "š" >> "š" WASTI
                  ^!Replace "Ž" >> "Ž" WASTI
                  ^!Replace "ž" >> "ž" WASTI
                  ^!Replace "Œ" >> "Œ" WASTI
                  ^!Replace "œ" >> "œ" WASTI
                  ^!Replace "Ÿ" >> "Ÿ" WASTI

                  It is advisable to check for legal UTF-8, i.e. no non-UTF 8-bit
                  characters, first:

                  :loop
                  ^!Find "([\x80-\xBF]|[\xC0-\xFF][\x80-\xBF]*)" RS
                  ^!IfError usasc
                  ^!IfMatch "[\xC2-\xDF][\x80-\xBF]" "^$GetSelection$" loop
                  ^!IfMatch "\xE0[\xA0-\xBF][\x80-\xBF]" "^$GetSelection$" loop
                  ^!IfMatch "[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}" "^$GetSelection$" loop
                  ^!IfMatch "\xED[\x80-\x9F][\x80-\xBF]" "^$GetSelection$" loop
                  ^!IfMatch "\xF0[\x90-\xBF][\x80-\xBF]{2}" "^$GetSelection$" loop
                  ^!IfMatch "[\xF1-\xF3][\x80-\xBF]{3}" "^$GetSelection$" loop
                  ^!IfMatch "\xF4[\x80-\x8F][\x80-\xBF]{2}" "^$GetSelection$" loop
                  ^!Continue Illegal sequence, no UTF-8
                  ^!Goto loop
                  :usasc
                  ^!Continue No errors found

                  Both clips do not start with a ^!Jump TEXT_START and begin at the
                  current cursor position. This is on purpose, but you might want to
                  change it.

                  Axel

                  --
                  Dipl.-Ing. F. Axel Berger Tel: +49/ 2174/ 7439 07
                  Johann-Häck-Str. 14 Fax: +49/ 2174/ 7439 68
                  D-51519 Odenthal-Heide eMail: Axel-Berger@...
                  Deutschland (Germany) http://berger-odenthal.de
                • Art Kocsis
                  Axel, Thanks for the clip. It will be interesting to analyze it in detail which I will have time for later. After a quick scan it looks like you still have a
                  Message 8 of 12 , Jan 12 4:18 PM
                    Axel,

                    Thanks for the clip. It will be interesting to analyze it in detail which I
                    will have time for later.
                    After a quick scan it looks like you still have a Unicode document, i.e.,
                    two bytes per character. Since my Unicode source docs, such as Windows
                    registry exports or web text, I want to just map into the ANSI character
                    set and delete the upper byte. Setting NTB's option to not protect Unicode
                    (as loro suggested) will probably work for me.

                    Just in case you are not aware of it, Andrew West's Babelstone site
                    [http://www.babelstone.co.uk/index.html%5d has an online and a freeware
                    mapper for all the 110,116 (and counting!) Unicode chars, a freeware
                    Unicode editor, a three hour(!) slide show of all the glyphs as well as
                    LOTS of other interesting goodies. One could spend hours reading just one
                    of his blogs, such as "Mani Stones in Mani Scripts"
                    [http://babelstone.blogspot.com/2006/11/mani-stones-in-many-scripts.html%5d.
                    The detail, depth and extent of the info presented is captivating and
                    amazing (especially to anyone who has ever chanted the Sanskrit mantra: om
                    mani padme hum).

                    I also found it interesting to see your bilingual roots showing in your
                    code as evidenced by the mixture of German and English labels. It works
                    well. When I was in Germany a few years ago some of my (very poorly
                    remembered), linguistic training came out as well. I found my self counting
                    eins, zwei, drei, cuatro, cinco, seis! My HS & college teachers would be
                    proud<g>. (Or turning over in their graves!)

                    Art


                    At 1/11/2012 11:08 PM, Axel wrote:
                    >Art Kocsis wrote:
                    > > will your clip be available after you get finished tweaking it?
                    >
                    >There was a very silly typing mistake in the first draft that didn't
                    >come to light until I first came upon non-ANSI characters. This seems to
                    >work (many ^!Set lines are long):
                    <snip>
                    >Axel
                  • Axel Berger
                    ... No. I map everything to ANSI than can be converted (label latin-1), which is less than 128 characters. All the rest can t be mapped so I convert all that
                    Message 9 of 12 , Jan 13 1:44 AM
                      Art Kocsis wrote:
                      > After a quick scan it looks like you still have a Unicode
                      > document, i.e., two bytes per character.

                      No. I map everything to ANSI than can be converted (label latin-1),
                      which is less than 128 characters. All the rest can't be mapped so I
                      convert all that to HTML entities like –

                      Axel
                    • Art Kocsis
                      OK, after a longer scan I see better what you are doing. However, my question/statement still stands. Just to make sure we are on the same page I am defining a
                      Message 10 of 12 , Jan 13 12:59 PM
                        OK, after a longer scan I see better what you are doing.
                        However, my question/statement still stands.

                        Just to make sure we are on the same page I am defining
                        a Unicode document as one that uses two bytes to encode
                        each character. Most of the time (for western documents),
                        the high byte is zero.

                        When I load, edit and save a Unicode document, it retains
                        the dual byte format. The mapping is 100% to the ANSI
                        character set but it still uses two bytes per character. This
                        holds whether I turn Unicode protection off or if I leave it on
                        and manually remove the read-only flag. Have you looked at
                        your files with a binary viewer to verify a single or dual byte
                        encoding format?

                        Your clip maps characters with non-zero high bytes to the
                        127 character Latin-1 set but still (I think), retains the dual
                        byte encoding. Since the high byte in my files are already
                        all zeros, what I want to do is just delete the high byte.

                        Since turning Unicode protection didn't help me, I thought
                        modifying your clip code would do the trick: simply replace
                        every dual character with its low byte contents but it doesn't
                        work.

                        FIND reported zero matches for all variations of "\x00[\x00-\xff]",
                        "[\x00][\x00-\xff]" and "[\x00-\x00][\x00-\xff]" yet the source
                        file strictly consisted of alternating zero and non-zero bytes.
                        What are your NTB option settings and how do you load your
                        files to be able to "see" the high byte?

                        Unless I can come up with a better way I suppose this would work:

                        Load the file,
                        Capture the full file name
                        Copy all to the clipboard,
                        Close the file,
                        Paste to a new doc
                        Save the doc to the captured file name

                        but it seems so inelegant and inefficient.

                        Art

                        At 1/13/2012 01:44 AM, Axel wrote:
                        >Art Kocsis wrote:
                        > > After a quick scan it looks like you still have a Unicode
                        > > document, i.e., two bytes per character.
                        >
                        >No. I map everything to ANSI than can be converted (label latin-1),
                        >which is less than 128 characters. All the rest can't be mapped so I
                        >convert all that to HTML entities like –
                      Your message has been successfully submitted and would be delivered to recipients shortly.