Loading ...
Sorry, an error occurred while loading the content.

RE: [Clip] Trouble with UTF

Expand Messages
  • John Shotsky
    I think if you copy your modified doc and paste it into a new doc and save, all will be fine (Assuming you have new docs set for DOS Ascii.). Save, and Save As
    Message 1 of 12 , Jan 8, 2012
    View Source
    • 0 Attachment
      I think if you copy your modified doc and paste it into a new doc and save, all will be fine (Assuming you have new docs
      set for DOS Ascii.). Save, and Save As seem to have some residual memory of what used to be, because it knows what it
      was opened as, and generally speaking will save as that format. I have built that right into my clip handling, so I
      don't think about it anymore. You can get some strange characters from email or the web.

      Regards,
      John
      RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/

      From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Axel Berger
      Sent: Sunday, January 08, 2012 12:45
      To: NoteTab Clips
      Subject: [Clip] Trouble with UTF


      I'm having serious trouble with the so-called UTF-8 capabilities of
      NoteTab, so bad, I might even have to go back to version 5.8.

      I've got a working clip to convert all legal UTF-8 to either ANSI or
      entities like –. Trouble is, unless I jump through hoops it can't
      work, as files are loaded read-only.
      N.B: Whenever that happens I want a really big alert box! All too often
      have I juggled with a clip that just would not work and not seen that
      tiny "read only" at the bottom of the window.

      After conversion and after saving and reloading things still won't work.
      The file is now clean and pure ANSI but somewhere somehow NoteTab seems
      to remember it used to be UTF-8. I have to tell NoteTab explicitly to
      load it as ANSI for stuff to work again. With old, proven and reliable
      clips I don't usually check but just move on, only to find much later,
      that nothing was done.

      This is just not good enough. I really need to be able to turn those
      dysfunctional UTF capabilities off.

      Axel

      --
      Dipl.-Ing. F. Axel Berger Tel: +49/ 2174/ 7439 07
      Johann-H�ck-Str. 14 Fax: +49/ 2174/ 7439 68
      D-51519 Odenthal-Heide eMail: Axel-Berger@... <mailto:Axel-Berger%40Nexgo.De>
      Deutschland (Germany) http://berger-odenthal.de



      [Non-text portions of this message have been removed]
    • Art Kocsis
      Axel, A couple of suggestions: You could create you own alert box. Insert a test line at the beginning of the document and then check for error condition or
      Message 2 of 12 , Jan 8, 2012
      View Source
      • 0 Attachment
        Axel,

        A couple of suggestions:

        You could create you own alert box.
        Insert a test line at the beginning of the document and then check
        for error condition or test for its existence, then pop up an alert.

        Using the results of the above test, toggle the read only flag for the
        document: ^!Menu/Document/Read-Only.
        Unfortunately, it is a toggle so you have to test first.

        Instead of opening the Unicode document, load its ANSI text into
        a new tab with ^$GetUnicodeFileText("FileName")$.

        However, this may not work for you as it may destroy the non-ANSI
        characters that you converted to the – form.

        The Unicode problem is a big PITA as more and more docs appear
        in Unicode. NTB really needs a built-in conversion command but I
        am not holding my breath. Updates are few and far between and then
        only minor.

        Art


        At 1/8/2012 12:45 PM, Axel wrote:
        >I'm having serious trouble with the so-called UTF-8 capabilities of
        >NoteTab, so bad, I might even have to go back to version 5.8.
        >
        >I've got a working clip to convert all legal UTF-8 to either ANSI or
        >entities like –. Trouble is, unless I jump through hoops it can't
        >work, as files are loaded read-only.
        >N.B: Whenever that happens I want a really big alert box! All too often
        >have I juggled with a clip that just would not work and not seen that
        >tiny "read only" at the bottom of the window.
      • bruce.somers@web.de
        ... Is NoteTab now orphaned software? Is it still being maintained? Sad, if it is not. Bruce
        Message 3 of 12 , Jan 8, 2012
        View Source
        • 0 Attachment
          > The Unicode problem is a big PITA as more and more docs appear in Unicode. NTB really needs a built-in conversion command but I
          > am not holding my breath. Updates are few and far between and then only minor.

          Is NoteTab now orphaned software? Is it still being maintained? Sad, if it is not.

          Bruce
        • Axel Berger
          ... Thanks! I had looked in vain for just that all over the place and failed to find it. Failing to be able to turn read-only off I had to resort to
          Message 4 of 12 , Jan 9, 2012
          View Source
          • 0 Attachment
            Art Kocsis wrote:
            > ^!Menu/Document/Read-Only.

            Thanks! I had looked in vain for just that all over the place and failed
            to find it. Failing to be able to turn read-only off I had to resort to
            complicated gymnastics using load as. This will really make life easier.
            Why, oh why, is it not in the document properties box?

            Your other two hints are impractical though. The thing is, I'm surprised
            by the read-only where I did not expect it, as when saving a whole web
            page and opening it through a doubleclick.

            > NTB really needs a built-in conversion command

            My simple clip suffices. As long as NoteTab can't really work with
            characters of more than one byte and doesn't offer full UTF capability,
            it shouldn't pretend to do so. UTF opened as is may look strange but can
            be worked on, which is the only important thing.

            Axel

            --
            Dipl.-Ing. F. Axel Berger Tel: +49/ 2174/ 7439 07
            Johann-Häck-Str. 14 Fax: +49/ 2174/ 7439 68
            D-51519 Odenthal-Heide eMail: Axel-Berger@...
            Deutschland (Germany) http://berger-odenthal.de
          • loro
            ... You can turn it off in Options, the General tab ... Protect Unicode Files: When NoteTab opens Unicode files, it has to convert them to the ANSI character
            Message 5 of 12 , Jan 11, 2012
            View Source
            • 0 Attachment
              At 09:28 2012-01-09, Axel Berger wrote:
              >Your other two hints are impractical though. The thing is, I'm surprised
              >by the read-only where I did not expect it, as when saving a whole web
              >page and opening it through a doubleclick.

              You can turn it off in Options, the General tab

              -----
              Protect Unicode Files: When NoteTab opens Unicode files, it has to
              convert them to the ANSI character set, which handles much fewer
              characters. As a result, the conversion process may drop non-ANSI and
              cause the loss of information. When this setting is checked, Unicode
              files are opened in Read-Only mode, which protects the file from
              changes. If you know that your Unicode documents will not loose
              important information during the conversion process, you can uncheck
              this option in order to open such files in editable mode
              -----

              Lotta
            • Axel Berger
              ... Thanks Loro, I had of course looked there but overlooked that (to me) cryptic option. But what about ... I want no conversion and opening files as-is, as
              Message 6 of 12 , Jan 11, 2012
              View Source
              • 0 Attachment
                loro wrote:
                > You can turn it off in Options, the General tab
                > Protect Unicode Files:

                Thanks Loro, I had of course looked there but overlooked that (to me)
                cryptic option. But what about

                > As a result, the conversion process may drop non-ANSI and
                > cause the loss of information.

                I want no conversion and opening files as-is, as the "UTF-8 (no
                conversion)" setting in the open dialog would do, and do all conversions
                myself. At the very least I do not want to simply lose something.

                Danke
                Axel

                P.S: I still want a warning message. Sometimes I have set files on disk
                to read-only myself and don't think about it, and i hate it, when
                editing just refuses to work. Other programs allow full editing and only
                remap "save" to "save as" in these cases. A much better solution IMHO.

                --
                Dipl.-Ing. F. Axel Berger Tel: +49/ 2174/ 7439 07
                Johann-Häck-Str. 14 Fax: +49/ 2174/ 7439 68
                D-51519 Odenthal-Heide eMail: Axel-Berger@...
                Deutschland (Germany) http://berger-odenthal.de
              • Art Kocsis
                The first thing I thought of is that there is that big difference between turning the protect option off vs loading a document as Unicode and then turning the
                Message 7 of 12 , Jan 11, 2012
                View Source
                • 0 Attachment
                  The first thing I thought of is that there is that big difference between
                  turning the protect option off vs loading a document as Unicode and then
                  turning the read-only off. Even after editing a Unicode doc and saving it,
                  the Unicode characters are still intact but they would be lost if the doc
                  was loaded as ANSI.

                  BTW, Axel, will your clip be available after you get finished tweaking it?

                  Art


                  At 1/11/2012 11:20 AM, Axel wrote:
                  >loro wrote:
                  > > You can turn it off in Options, the General tab
                  > > Protect Unicode Files:
                  >
                  >Thanks Loro, I had of course looked there but overlooked that (to me)
                  >cryptic option. But what about
                  >
                  > > As a result, the conversion process may drop non-ANSI and
                  > > cause the loss of information.

                  <snip>

                  >Axel
                  >
                  >P.S: I still want a warning message. Sometimes I have set files on disk
                  >to read-only myself and don't think about it, and i hate it, when
                  >editing just refuses to work. Other programs allow full editing and only
                  >remap "save" to "save as" in these cases. A much better solution IMHO.
                • Axel Berger
                  ... There was a very silly typing mistake in the first draft that didn t come to light until I first came upon non-ANSI characters. This seems to ... ^!Find
                  Message 8 of 12 , Jan 11, 2012
                  View Source
                  • 0 Attachment
                    Art Kocsis wrote:
                    > will your clip be available after you get finished tweaking it?

                    There was a very silly typing mistake in the first draft that didn't
                    come to light until I first came upon non-ANSI characters. This seems to
                    work (many ^!Set lines are long):

                    :loop
                    ^!Find "[\xC0-\xF7][\x80-\xBF]*" RS
                    ^!IfError donelatin
                    ^!IfMatch "[\xC2-\xC3][\x80-\xBF]" "^$GetSelection$" latin1
                    ^!IfMatch "[\xC0-\xDF][\x80-\xBF]" "^$GetSelection$" zwei
                    ^!IfMatch "[\xE0-\xEF][\x80-\xBF]{2}" "^$GetSelection$" drei
                    ^!IfMatch "[\xF0-\xF7][\x80-\xBF]{3}" "^$GetSelection$" vier
                    ^!Continue Illegal sequence, can't be converted.
                    ^!Goto loop
                    :zwei
                    ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
                    64)$
                    ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
                    32)$
                    ^!Set %third%=0
                    ^!Set %fourth%=0
                    ^!Goto makeent
                    :drei
                    ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";3)$)$ MOD
                    64)$
                    ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
                    64)$
                    ^!Set %third%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
                    16)$
                    ^!Set %fourth%=0
                    ^!Goto makeent
                    :vier
                    ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";4)$)$ MOD
                    64)$
                    ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";3)$)$ MOD
                    64)$
                    ^!Set %third%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
                    64)$
                    ^!Set %fourth%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
                    8)$
                    :makeent
                    ^!Set
                    %first%=^$Calc(262144*^%fourth%+4096*^%third%+64*^%second%+^%first%;0)$
                    ^!InsertText &#^%first%;
                    ^!Goto loop
                    :latin1
                    ^!Set %first%=^$StrCopyRight("^$GetSelection$";1)$
                    ^!Set %second%=^$StrCopyLeft("^$GetSelection$";1)$
                    ^!Set %first%=^$Calc(^$CharToDec(^%first%)$ MOD 64)$
                    ^!Set %second%=^$Calc(^$CharToDec(^%second%)$ MOD 4)$
                    ^!InsertText ^$DecToChar(^$Calc(64*^%second%+^%first%)$)$
                    ^!Goto loop
                    :donelatin
                    ^!Replace "€" >> "€" WASTI
                    ^!Replace "Š" >> "Š" WASTI
                    ^!Replace "š" >> "š" WASTI
                    ^!Replace "Ž" >> "Ž" WASTI
                    ^!Replace "ž" >> "ž" WASTI
                    ^!Replace "Œ" >> "Œ" WASTI
                    ^!Replace "œ" >> "œ" WASTI
                    ^!Replace "Ÿ" >> "Ÿ" WASTI

                    It is advisable to check for legal UTF-8, i.e. no non-UTF 8-bit
                    characters, first:

                    :loop
                    ^!Find "([\x80-\xBF]|[\xC0-\xFF][\x80-\xBF]*)" RS
                    ^!IfError usasc
                    ^!IfMatch "[\xC2-\xDF][\x80-\xBF]" "^$GetSelection$" loop
                    ^!IfMatch "\xE0[\xA0-\xBF][\x80-\xBF]" "^$GetSelection$" loop
                    ^!IfMatch "[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}" "^$GetSelection$" loop
                    ^!IfMatch "\xED[\x80-\x9F][\x80-\xBF]" "^$GetSelection$" loop
                    ^!IfMatch "\xF0[\x90-\xBF][\x80-\xBF]{2}" "^$GetSelection$" loop
                    ^!IfMatch "[\xF1-\xF3][\x80-\xBF]{3}" "^$GetSelection$" loop
                    ^!IfMatch "\xF4[\x80-\x8F][\x80-\xBF]{2}" "^$GetSelection$" loop
                    ^!Continue Illegal sequence, no UTF-8
                    ^!Goto loop
                    :usasc
                    ^!Continue No errors found

                    Both clips do not start with a ^!Jump TEXT_START and begin at the
                    current cursor position. This is on purpose, but you might want to
                    change it.

                    Axel

                    --
                    Dipl.-Ing. F. Axel Berger Tel: +49/ 2174/ 7439 07
                    Johann-Häck-Str. 14 Fax: +49/ 2174/ 7439 68
                    D-51519 Odenthal-Heide eMail: Axel-Berger@...
                    Deutschland (Germany) http://berger-odenthal.de
                  • Art Kocsis
                    Axel, Thanks for the clip. It will be interesting to analyze it in detail which I will have time for later. After a quick scan it looks like you still have a
                    Message 9 of 12 , Jan 12, 2012
                    View Source
                    • 0 Attachment
                      Axel,

                      Thanks for the clip. It will be interesting to analyze it in detail which I
                      will have time for later.
                      After a quick scan it looks like you still have a Unicode document, i.e.,
                      two bytes per character. Since my Unicode source docs, such as Windows
                      registry exports or web text, I want to just map into the ANSI character
                      set and delete the upper byte. Setting NTB's option to not protect Unicode
                      (as loro suggested) will probably work for me.

                      Just in case you are not aware of it, Andrew West's Babelstone site
                      [http://www.babelstone.co.uk/index.html%5d has an online and a freeware
                      mapper for all the 110,116 (and counting!) Unicode chars, a freeware
                      Unicode editor, a three hour(!) slide show of all the glyphs as well as
                      LOTS of other interesting goodies. One could spend hours reading just one
                      of his blogs, such as "Mani Stones in Mani Scripts"
                      [http://babelstone.blogspot.com/2006/11/mani-stones-in-many-scripts.html%5d.
                      The detail, depth and extent of the info presented is captivating and
                      amazing (especially to anyone who has ever chanted the Sanskrit mantra: om
                      mani padme hum).

                      I also found it interesting to see your bilingual roots showing in your
                      code as evidenced by the mixture of German and English labels. It works
                      well. When I was in Germany a few years ago some of my (very poorly
                      remembered), linguistic training came out as well. I found my self counting
                      eins, zwei, drei, cuatro, cinco, seis! My HS & college teachers would be
                      proud<g>. (Or turning over in their graves!)

                      Art


                      At 1/11/2012 11:08 PM, Axel wrote:
                      >Art Kocsis wrote:
                      > > will your clip be available after you get finished tweaking it?
                      >
                      >There was a very silly typing mistake in the first draft that didn't
                      >come to light until I first came upon non-ANSI characters. This seems to
                      >work (many ^!Set lines are long):
                      <snip>
                      >Axel
                    • Axel Berger
                      ... No. I map everything to ANSI than can be converted (label latin-1), which is less than 128 characters. All the rest can t be mapped so I convert all that
                      Message 10 of 12 , Jan 13, 2012
                      View Source
                      • 0 Attachment
                        Art Kocsis wrote:
                        > After a quick scan it looks like you still have a Unicode
                        > document, i.e., two bytes per character.

                        No. I map everything to ANSI than can be converted (label latin-1),
                        which is less than 128 characters. All the rest can't be mapped so I
                        convert all that to HTML entities like –

                        Axel
                      • Art Kocsis
                        OK, after a longer scan I see better what you are doing. However, my question/statement still stands. Just to make sure we are on the same page I am defining a
                        Message 11 of 12 , Jan 13, 2012
                        View Source
                        • 0 Attachment
                          OK, after a longer scan I see better what you are doing.
                          However, my question/statement still stands.

                          Just to make sure we are on the same page I am defining
                          a Unicode document as one that uses two bytes to encode
                          each character. Most of the time (for western documents),
                          the high byte is zero.

                          When I load, edit and save a Unicode document, it retains
                          the dual byte format. The mapping is 100% to the ANSI
                          character set but it still uses two bytes per character. This
                          holds whether I turn Unicode protection off or if I leave it on
                          and manually remove the read-only flag. Have you looked at
                          your files with a binary viewer to verify a single or dual byte
                          encoding format?

                          Your clip maps characters with non-zero high bytes to the
                          127 character Latin-1 set but still (I think), retains the dual
                          byte encoding. Since the high byte in my files are already
                          all zeros, what I want to do is just delete the high byte.

                          Since turning Unicode protection didn't help me, I thought
                          modifying your clip code would do the trick: simply replace
                          every dual character with its low byte contents but it doesn't
                          work.

                          FIND reported zero matches for all variations of "\x00[\x00-\xff]",
                          "[\x00][\x00-\xff]" and "[\x00-\x00][\x00-\xff]" yet the source
                          file strictly consisted of alternating zero and non-zero bytes.
                          What are your NTB option settings and how do you load your
                          files to be able to "see" the high byte?

                          Unless I can come up with a better way I suppose this would work:

                          Load the file,
                          Capture the full file name
                          Copy all to the clipboard,
                          Close the file,
                          Paste to a new doc
                          Save the doc to the captured file name

                          but it seems so inelegant and inefficient.

                          Art

                          At 1/13/2012 01:44 AM, Axel wrote:
                          >Art Kocsis wrote:
                          > > After a quick scan it looks like you still have a Unicode
                          > > document, i.e., two bytes per character.
                          >
                          >No. I map everything to ANSI than can be converted (label latin-1),
                          >which is less than 128 characters. All the rest can't be mapped so I
                          >convert all that to HTML entities like –
                        Your message has been successfully submitted and would be delivered to recipients shortly.