Loading ...
Sorry, an error occurred while loading the content.

Trouble with UTF

Expand Messages
  • Axel Berger
    I m having serious trouble with the so-called UTF-8 capabilities of NoteTab, so bad, I might even have to go back to version 5.8. I ve got a working clip to
    Message 1 of 12 , Jan 8, 2012
    • 0 Attachment
      I'm having serious trouble with the so-called UTF-8 capabilities of
      NoteTab, so bad, I might even have to go back to version 5.8.

      I've got a working clip to convert all legal UTF-8 to either ANSI or
      entities like –. Trouble is, unless I jump through hoops it can't
      work, as files are loaded read-only.
      N.B: Whenever that happens I want a really big alert box! All too often
      have I juggled with a clip that just would not work and not seen that
      tiny "read only" at the bottom of the window.

      After conversion and after saving and reloading things still won't work.
      The file is now clean and pure ANSI but somewhere somehow NoteTab seems
      to remember it used to be UTF-8. I have to tell NoteTab explicitly to
      load it as ANSI for stuff to work again. With old, proven and reliable
      clips I don't usually check but just move on, only to find much later,
      that nothing was done.

      This is just not good enough. I really need to be able to turn those
      dysfunctional UTF capabilities off.

      Axel

      --
      Dipl.-Ing. F. Axel Berger Tel: +49/ 2174/ 7439 07
      Johann-Häck-Str. 14 Fax: +49/ 2174/ 7439 68
      D-51519 Odenthal-Heide eMail: Axel-Berger@...
      Deutschland (Germany) http://berger-odenthal.de
    • John Shotsky
      I think if you copy your modified doc and paste it into a new doc and save, all will be fine (Assuming you have new docs set for DOS Ascii.). Save, and Save As
      Message 2 of 12 , Jan 8, 2012
      • 0 Attachment
        I think if you copy your modified doc and paste it into a new doc and save, all will be fine (Assuming you have new docs
        set for DOS Ascii.). Save, and Save As seem to have some residual memory of what used to be, because it knows what it
        was opened as, and generally speaking will save as that format. I have built that right into my clip handling, so I
        don't think about it anymore. You can get some strange characters from email or the web.

        Regards,
        John
        RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/

        From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Axel Berger
        Sent: Sunday, January 08, 2012 12:45
        To: NoteTab Clips
        Subject: [Clip] Trouble with UTF


        I'm having serious trouble with the so-called UTF-8 capabilities of
        NoteTab, so bad, I might even have to go back to version 5.8.

        I've got a working clip to convert all legal UTF-8 to either ANSI or
        entities like –. Trouble is, unless I jump through hoops it can't
        work, as files are loaded read-only.
        N.B: Whenever that happens I want a really big alert box! All too often
        have I juggled with a clip that just would not work and not seen that
        tiny "read only" at the bottom of the window.

        After conversion and after saving and reloading things still won't work.
        The file is now clean and pure ANSI but somewhere somehow NoteTab seems
        to remember it used to be UTF-8. I have to tell NoteTab explicitly to
        load it as ANSI for stuff to work again. With old, proven and reliable
        clips I don't usually check but just move on, only to find much later,
        that nothing was done.

        This is just not good enough. I really need to be able to turn those
        dysfunctional UTF capabilities off.

        Axel

        --
        Dipl.-Ing. F. Axel Berger Tel: +49/ 2174/ 7439 07
        Johann-H�ck-Str. 14 Fax: +49/ 2174/ 7439 68
        D-51519 Odenthal-Heide eMail: Axel-Berger@... <mailto:Axel-Berger%40Nexgo.De>
        Deutschland (Germany) http://berger-odenthal.de



        [Non-text portions of this message have been removed]
      • Art Kocsis
        Axel, A couple of suggestions: You could create you own alert box. Insert a test line at the beginning of the document and then check for error condition or
        Message 3 of 12 , Jan 8, 2012
        • 0 Attachment
          Axel,

          A couple of suggestions:

          You could create you own alert box.
          Insert a test line at the beginning of the document and then check
          for error condition or test for its existence, then pop up an alert.

          Using the results of the above test, toggle the read only flag for the
          document: ^!Menu/Document/Read-Only.
          Unfortunately, it is a toggle so you have to test first.

          Instead of opening the Unicode document, load its ANSI text into
          a new tab with ^$GetUnicodeFileText("FileName")$.

          However, this may not work for you as it may destroy the non-ANSI
          characters that you converted to the – form.

          The Unicode problem is a big PITA as more and more docs appear
          in Unicode. NTB really needs a built-in conversion command but I
          am not holding my breath. Updates are few and far between and then
          only minor.

          Art


          At 1/8/2012 12:45 PM, Axel wrote:
          >I'm having serious trouble with the so-called UTF-8 capabilities of
          >NoteTab, so bad, I might even have to go back to version 5.8.
          >
          >I've got a working clip to convert all legal UTF-8 to either ANSI or
          >entities like –. Trouble is, unless I jump through hoops it can't
          >work, as files are loaded read-only.
          >N.B: Whenever that happens I want a really big alert box! All too often
          >have I juggled with a clip that just would not work and not seen that
          >tiny "read only" at the bottom of the window.
        • bruce.somers@web.de
          ... Is NoteTab now orphaned software? Is it still being maintained? Sad, if it is not. Bruce
          Message 4 of 12 , Jan 8, 2012
          • 0 Attachment
            > The Unicode problem is a big PITA as more and more docs appear in Unicode. NTB really needs a built-in conversion command but I
            > am not holding my breath. Updates are few and far between and then only minor.

            Is NoteTab now orphaned software? Is it still being maintained? Sad, if it is not.

            Bruce
          • Axel Berger
            ... Thanks! I had looked in vain for just that all over the place and failed to find it. Failing to be able to turn read-only off I had to resort to
            Message 5 of 12 , Jan 9, 2012
            • 0 Attachment
              Art Kocsis wrote:
              > ^!Menu/Document/Read-Only.

              Thanks! I had looked in vain for just that all over the place and failed
              to find it. Failing to be able to turn read-only off I had to resort to
              complicated gymnastics using load as. This will really make life easier.
              Why, oh why, is it not in the document properties box?

              Your other two hints are impractical though. The thing is, I'm surprised
              by the read-only where I did not expect it, as when saving a whole web
              page and opening it through a doubleclick.

              > NTB really needs a built-in conversion command

              My simple clip suffices. As long as NoteTab can't really work with
              characters of more than one byte and doesn't offer full UTF capability,
              it shouldn't pretend to do so. UTF opened as is may look strange but can
              be worked on, which is the only important thing.

              Axel

              --
              Dipl.-Ing. F. Axel Berger Tel: +49/ 2174/ 7439 07
              Johann-Häck-Str. 14 Fax: +49/ 2174/ 7439 68
              D-51519 Odenthal-Heide eMail: Axel-Berger@...
              Deutschland (Germany) http://berger-odenthal.de
            • loro
              ... You can turn it off in Options, the General tab ... Protect Unicode Files: When NoteTab opens Unicode files, it has to convert them to the ANSI character
              Message 6 of 12 , Jan 11, 2012
              • 0 Attachment
                At 09:28 2012-01-09, Axel Berger wrote:
                >Your other two hints are impractical though. The thing is, I'm surprised
                >by the read-only where I did not expect it, as when saving a whole web
                >page and opening it through a doubleclick.

                You can turn it off in Options, the General tab

                -----
                Protect Unicode Files: When NoteTab opens Unicode files, it has to
                convert them to the ANSI character set, which handles much fewer
                characters. As a result, the conversion process may drop non-ANSI and
                cause the loss of information. When this setting is checked, Unicode
                files are opened in Read-Only mode, which protects the file from
                changes. If you know that your Unicode documents will not loose
                important information during the conversion process, you can uncheck
                this option in order to open such files in editable mode
                -----

                Lotta
              • Axel Berger
                ... Thanks Loro, I had of course looked there but overlooked that (to me) cryptic option. But what about ... I want no conversion and opening files as-is, as
                Message 7 of 12 , Jan 11, 2012
                • 0 Attachment
                  loro wrote:
                  > You can turn it off in Options, the General tab
                  > Protect Unicode Files:

                  Thanks Loro, I had of course looked there but overlooked that (to me)
                  cryptic option. But what about

                  > As a result, the conversion process may drop non-ANSI and
                  > cause the loss of information.

                  I want no conversion and opening files as-is, as the "UTF-8 (no
                  conversion)" setting in the open dialog would do, and do all conversions
                  myself. At the very least I do not want to simply lose something.

                  Danke
                  Axel

                  P.S: I still want a warning message. Sometimes I have set files on disk
                  to read-only myself and don't think about it, and i hate it, when
                  editing just refuses to work. Other programs allow full editing and only
                  remap "save" to "save as" in these cases. A much better solution IMHO.

                  --
                  Dipl.-Ing. F. Axel Berger Tel: +49/ 2174/ 7439 07
                  Johann-Häck-Str. 14 Fax: +49/ 2174/ 7439 68
                  D-51519 Odenthal-Heide eMail: Axel-Berger@...
                  Deutschland (Germany) http://berger-odenthal.de
                • Art Kocsis
                  The first thing I thought of is that there is that big difference between turning the protect option off vs loading a document as Unicode and then turning the
                  Message 8 of 12 , Jan 11, 2012
                  • 0 Attachment
                    The first thing I thought of is that there is that big difference between
                    turning the protect option off vs loading a document as Unicode and then
                    turning the read-only off. Even after editing a Unicode doc and saving it,
                    the Unicode characters are still intact but they would be lost if the doc
                    was loaded as ANSI.

                    BTW, Axel, will your clip be available after you get finished tweaking it?

                    Art


                    At 1/11/2012 11:20 AM, Axel wrote:
                    >loro wrote:
                    > > You can turn it off in Options, the General tab
                    > > Protect Unicode Files:
                    >
                    >Thanks Loro, I had of course looked there but overlooked that (to me)
                    >cryptic option. But what about
                    >
                    > > As a result, the conversion process may drop non-ANSI and
                    > > cause the loss of information.

                    <snip>

                    >Axel
                    >
                    >P.S: I still want a warning message. Sometimes I have set files on disk
                    >to read-only myself and don't think about it, and i hate it, when
                    >editing just refuses to work. Other programs allow full editing and only
                    >remap "save" to "save as" in these cases. A much better solution IMHO.
                  • Axel Berger
                    ... There was a very silly typing mistake in the first draft that didn t come to light until I first came upon non-ANSI characters. This seems to ... ^!Find
                    Message 9 of 12 , Jan 11, 2012
                    • 0 Attachment
                      Art Kocsis wrote:
                      > will your clip be available after you get finished tweaking it?

                      There was a very silly typing mistake in the first draft that didn't
                      come to light until I first came upon non-ANSI characters. This seems to
                      work (many ^!Set lines are long):

                      :loop
                      ^!Find "[\xC0-\xF7][\x80-\xBF]*" RS
                      ^!IfError donelatin
                      ^!IfMatch "[\xC2-\xC3][\x80-\xBF]" "^$GetSelection$" latin1
                      ^!IfMatch "[\xC0-\xDF][\x80-\xBF]" "^$GetSelection$" zwei
                      ^!IfMatch "[\xE0-\xEF][\x80-\xBF]{2}" "^$GetSelection$" drei
                      ^!IfMatch "[\xF0-\xF7][\x80-\xBF]{3}" "^$GetSelection$" vier
                      ^!Continue Illegal sequence, can't be converted.
                      ^!Goto loop
                      :zwei
                      ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
                      64)$
                      ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
                      32)$
                      ^!Set %third%=0
                      ^!Set %fourth%=0
                      ^!Goto makeent
                      :drei
                      ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";3)$)$ MOD
                      64)$
                      ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
                      64)$
                      ^!Set %third%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
                      16)$
                      ^!Set %fourth%=0
                      ^!Goto makeent
                      :vier
                      ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";4)$)$ MOD
                      64)$
                      ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";3)$)$ MOD
                      64)$
                      ^!Set %third%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
                      64)$
                      ^!Set %fourth%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
                      8)$
                      :makeent
                      ^!Set
                      %first%=^$Calc(262144*^%fourth%+4096*^%third%+64*^%second%+^%first%;0)$
                      ^!InsertText &#^%first%;
                      ^!Goto loop
                      :latin1
                      ^!Set %first%=^$StrCopyRight("^$GetSelection$";1)$
                      ^!Set %second%=^$StrCopyLeft("^$GetSelection$";1)$
                      ^!Set %first%=^$Calc(^$CharToDec(^%first%)$ MOD 64)$
                      ^!Set %second%=^$Calc(^$CharToDec(^%second%)$ MOD 4)$
                      ^!InsertText ^$DecToChar(^$Calc(64*^%second%+^%first%)$)$
                      ^!Goto loop
                      :donelatin
                      ^!Replace "€" >> "€" WASTI
                      ^!Replace "Š" >> "Š" WASTI
                      ^!Replace "š" >> "š" WASTI
                      ^!Replace "Ž" >> "Ž" WASTI
                      ^!Replace "ž" >> "ž" WASTI
                      ^!Replace "Œ" >> "Œ" WASTI
                      ^!Replace "œ" >> "œ" WASTI
                      ^!Replace "Ÿ" >> "Ÿ" WASTI

                      It is advisable to check for legal UTF-8, i.e. no non-UTF 8-bit
                      characters, first:

                      :loop
                      ^!Find "([\x80-\xBF]|[\xC0-\xFF][\x80-\xBF]*)" RS
                      ^!IfError usasc
                      ^!IfMatch "[\xC2-\xDF][\x80-\xBF]" "^$GetSelection$" loop
                      ^!IfMatch "\xE0[\xA0-\xBF][\x80-\xBF]" "^$GetSelection$" loop
                      ^!IfMatch "[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}" "^$GetSelection$" loop
                      ^!IfMatch "\xED[\x80-\x9F][\x80-\xBF]" "^$GetSelection$" loop
                      ^!IfMatch "\xF0[\x90-\xBF][\x80-\xBF]{2}" "^$GetSelection$" loop
                      ^!IfMatch "[\xF1-\xF3][\x80-\xBF]{3}" "^$GetSelection$" loop
                      ^!IfMatch "\xF4[\x80-\x8F][\x80-\xBF]{2}" "^$GetSelection$" loop
                      ^!Continue Illegal sequence, no UTF-8
                      ^!Goto loop
                      :usasc
                      ^!Continue No errors found

                      Both clips do not start with a ^!Jump TEXT_START and begin at the
                      current cursor position. This is on purpose, but you might want to
                      change it.

                      Axel

                      --
                      Dipl.-Ing. F. Axel Berger Tel: +49/ 2174/ 7439 07
                      Johann-Häck-Str. 14 Fax: +49/ 2174/ 7439 68
                      D-51519 Odenthal-Heide eMail: Axel-Berger@...
                      Deutschland (Germany) http://berger-odenthal.de
                    • Art Kocsis
                      Axel, Thanks for the clip. It will be interesting to analyze it in detail which I will have time for later. After a quick scan it looks like you still have a
                      Message 10 of 12 , Jan 12, 2012
                      • 0 Attachment
                        Axel,

                        Thanks for the clip. It will be interesting to analyze it in detail which I
                        will have time for later.
                        After a quick scan it looks like you still have a Unicode document, i.e.,
                        two bytes per character. Since my Unicode source docs, such as Windows
                        registry exports or web text, I want to just map into the ANSI character
                        set and delete the upper byte. Setting NTB's option to not protect Unicode
                        (as loro suggested) will probably work for me.

                        Just in case you are not aware of it, Andrew West's Babelstone site
                        [http://www.babelstone.co.uk/index.html%5d has an online and a freeware
                        mapper for all the 110,116 (and counting!) Unicode chars, a freeware
                        Unicode editor, a three hour(!) slide show of all the glyphs as well as
                        LOTS of other interesting goodies. One could spend hours reading just one
                        of his blogs, such as "Mani Stones in Mani Scripts"
                        [http://babelstone.blogspot.com/2006/11/mani-stones-in-many-scripts.html%5d.
                        The detail, depth and extent of the info presented is captivating and
                        amazing (especially to anyone who has ever chanted the Sanskrit mantra: om
                        mani padme hum).

                        I also found it interesting to see your bilingual roots showing in your
                        code as evidenced by the mixture of German and English labels. It works
                        well. When I was in Germany a few years ago some of my (very poorly
                        remembered), linguistic training came out as well. I found my self counting
                        eins, zwei, drei, cuatro, cinco, seis! My HS & college teachers would be
                        proud<g>. (Or turning over in their graves!)

                        Art


                        At 1/11/2012 11:08 PM, Axel wrote:
                        >Art Kocsis wrote:
                        > > will your clip be available after you get finished tweaking it?
                        >
                        >There was a very silly typing mistake in the first draft that didn't
                        >come to light until I first came upon non-ANSI characters. This seems to
                        >work (many ^!Set lines are long):
                        <snip>
                        >Axel
                      • Axel Berger
                        ... No. I map everything to ANSI than can be converted (label latin-1), which is less than 128 characters. All the rest can t be mapped so I convert all that
                        Message 11 of 12 , Jan 13, 2012
                        • 0 Attachment
                          Art Kocsis wrote:
                          > After a quick scan it looks like you still have a Unicode
                          > document, i.e., two bytes per character.

                          No. I map everything to ANSI than can be converted (label latin-1),
                          which is less than 128 characters. All the rest can't be mapped so I
                          convert all that to HTML entities like –

                          Axel
                        • Art Kocsis
                          OK, after a longer scan I see better what you are doing. However, my question/statement still stands. Just to make sure we are on the same page I am defining a
                          Message 12 of 12 , Jan 13, 2012
                          • 0 Attachment
                            OK, after a longer scan I see better what you are doing.
                            However, my question/statement still stands.

                            Just to make sure we are on the same page I am defining
                            a Unicode document as one that uses two bytes to encode
                            each character. Most of the time (for western documents),
                            the high byte is zero.

                            When I load, edit and save a Unicode document, it retains
                            the dual byte format. The mapping is 100% to the ANSI
                            character set but it still uses two bytes per character. This
                            holds whether I turn Unicode protection off or if I leave it on
                            and manually remove the read-only flag. Have you looked at
                            your files with a binary viewer to verify a single or dual byte
                            encoding format?

                            Your clip maps characters with non-zero high bytes to the
                            127 character Latin-1 set but still (I think), retains the dual
                            byte encoding. Since the high byte in my files are already
                            all zeros, what I want to do is just delete the high byte.

                            Since turning Unicode protection didn't help me, I thought
                            modifying your clip code would do the trick: simply replace
                            every dual character with its low byte contents but it doesn't
                            work.

                            FIND reported zero matches for all variations of "\x00[\x00-\xff]",
                            "[\x00][\x00-\xff]" and "[\x00-\x00][\x00-\xff]" yet the source
                            file strictly consisted of alternating zero and non-zero bytes.
                            What are your NTB option settings and how do you load your
                            files to be able to "see" the high byte?

                            Unless I can come up with a better way I suppose this would work:

                            Load the file,
                            Capture the full file name
                            Copy all to the clipboard,
                            Close the file,
                            Paste to a new doc
                            Save the doc to the captured file name

                            but it seems so inelegant and inefficient.

                            Art

                            At 1/13/2012 01:44 AM, Axel wrote:
                            >Art Kocsis wrote:
                            > > After a quick scan it looks like you still have a Unicode
                            > > document, i.e., two bytes per character.
                            >
                            >No. I map everything to ANSI than can be converted (label latin-1),
                            >which is less than 128 characters. All the rest can't be mapped so I
                            >convert all that to HTML entities like –
                          Your message has been successfully submitted and would be delivered to recipients shortly.