Loading ...
Sorry, an error occurred while loading the content.

Dealing with UTF

Expand Messages
  • Axel Berger
    Having been silly enough to volunteer, I shall probably have to deal with some UTF-8 pages in the near future. As they are apt to contain German umlauts,
    Message 1 of 10 , Jul 24, 2011
    • 0 Attachment
      Having been silly enough to volunteer, I shall probably have to deal
      with some UTF-8 pages in the near future. As they are apt to contain
      German umlauts, French accents and Cyrillic letters, possibly some Greek
      too, NoteTab's UTF mode won't be any use to me, as it would always
      destroy the pages. This is not too bad, I'll mostly deal with HTML tags,
      not existing content, so opening as "UTF-8 no conversion" will do me
      fine.

      I may do some editing too and then I shall need two things. The first is
      probably simple. I will need to convert typed umlauts to UTF. This could
      be "Ä"
      ^!Find "Ä(?![€-¿]" RASTI
      When not followed by any of those, it can't be part of a valid
      UTF-sequence. The same for "ä"
      ^!Find "ä(?![€-¿]{2}" RASTI
      So I can easily make replaces for the typical accents and umlauts.

      My second problem is harder. I will want to test for illegal characters.
      Legal UTF-8 can be defined as
      ^!Find "([\x01-\x7F]|[À-ß][€-¿]|[à-ÿ][€-¿]{2}" RASTI
      So illegal UTF-8 is the negative of that. Apart from reading characters
      one by one and processing thorough loads of ^!If's, is there an easy way
      to ^!Find an illegal character?

      Danke
      Axel
    • Alec Burgess
      ... I m not sure what you need to do with a specific illegal once found? Assuming non-legal characters are somewhat rare (?) could you: (1) make a copy of the
      Message 2 of 10 , Jul 24, 2011
      • 0 Attachment
        On 2011-07-24 17:59, Axel Berger wrote:
        > My second problem is harder. I will want to test for illegal characters.
        > Legal UTF-8 can be defined as
        > ^!Find "([\x01-\x7F]|[À-ß][€-¿]|[à-ÿ][€-¿]{2}" RASTI
        > So illegal UTF-8 is the negative of that. Apart from reading characters
        > one by one and processing thorough loads of ^!If's, is there an easy way
        > to ^!Find an illegal character?
        I'm not sure what you need to do with a specific illegal once found?

        Assuming non-legal characters are somewhat rare (?) could you:
        (1) make a copy of the full text
        (2) change your ^!Find above to a ^!Replace which converts ALL legal
        UTF-8 characters to the same character.
        (3) ^!Find every character not matching (hence and illegal character) in
        a loop
        (4) Calculate position of each and do what ever required to the matching
        character at same position in original buffer

        I've got a hunch (but haven't found the appropriate $Str....$ functions
        that having done (2) above you might be able to get results by changing
        either original and converted buffers to string(s?) and doing string
        functions on them ....

        Regards ... Alec (buralex@gmail& WinLiveMess - alec.m.burgess@skype)
      • John Shotsky
        You can include most of the legal characters within ranges within a negative class. Others can be added individually. Then, you can run a clip to locate any
        Message 3 of 10 , Jul 24, 2011
        • 0 Attachment
          You can include most of the legal characters within ranges within a negative class. Others can be added individually.
          Then, you can run a clip to locate any character that is not included in the negative classes.

          May I suggest BabelMap, the best character map I have found. Quite easy to determine the ranges by inspection.

          Regards,
          John

          From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Axel Berger
          Sent: Sunday, July 24, 2011 14:59
          To: NoteTab Clips
          Subject: [Clip] Dealing with UTF


          Having been silly enough to volunteer, I shall probably have to deal
          with some UTF-8 pages in the near future. As they are apt to contain
          German umlauts, French accents and Cyrillic letters, possibly some Greek
          too, NoteTab's UTF mode won't be any use to me, as it would always
          destroy the pages. This is not too bad, I'll mostly deal with HTML tags,
          not existing content, so opening as "UTF-8 no conversion" will do me
          fine.

          I may do some editing too and then I shall need two things. The first is
          probably simple. I will need to convert typed umlauts to UTF. This could
          be "�"
          ^!Find "�(?![�-�]" RASTI
          When not followed by any of those, it can't be part of a valid
          UTF-sequence. The same for "�"
          ^!Find "�(?![�-�]{2}" RASTI
          So I can easily make replaces for the typical accents and umlauts.

          My second problem is harder. I will want to test for illegal characters.
          Legal UTF-8 can be defined as
          ^!Find "([\x01-\x7F]|[�-�][�-�]|[�-�][�-�]{2}" RASTI
          So illegal UTF-8 is the negative of that. Apart from reading characters
          one by one and processing thorough loads of ^!If's, is there an easy way
          to ^!Find an illegal character?

          Danke
          Axel



          [Non-text portions of this message have been removed]
        • flo.gehrke
          ... Axel, To be frank, I m not much familar with encoding. But I found an UTF-8-ness checker using a regular expression at...
          Message 4 of 10 , Jul 24, 2011
          • 0 Attachment
            --- In ntb-clips@yahoogroups.com, Axel Berger <Axel-Berger@...> wrote:
            >
            > My second problem is harder. I will want to test for illegal
            > characters. Legal UTF-8 can be defined as
            > ^!Find "([\x01-\x7F]|[À-ß][?-¿]|[à-ÿ][?-¿]{2}" RASTI
            > So illegal UTF-8 is the negative of that. Apart from
            > reading characters one by one and processing thorough
            > loads of ^!If's, is there an easy way to ^!Find an
            > illegal character?

            Axel,

            To be frank, I'm not much familar with encoding. But I found an "UTF-8-ness checker using a regular expression" at...

            http://www.php.net/manual/de/function.mb-detect-encoding.php#50087

            It's written for PHP. So I took that RegEx and transcribed it to the PCRE used with NoteTab. To make it more readable, I inserted some spaces (using the '(?x)' modifier) and took over the comments from that PHP version. Since it finds legal UTF-8, I turned it into a negation by using a Negative Lookahead. It says: Find any character at a position where you DO NOT see a legal UTF-8 when looking ahead. So this '^!Find' could possibly match any illegal UTF-8:

            ^!Find "(?x) (?!(?:(?#ASCII)[\x09\x0A\x0D\x20-\x7E] | (?#non-overlong 2-byte)[\xC2-\xDF][\x80-\xBF] | (?#excluding overlongs)\xE0[\xA0-\xBF][\x80-\xBF] | (?#straight 3-byte)[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | (?#excluding surrogates)\xED[\x80-\x9F][\x80-\xBF] | (?#planes 1-3)\xF0[\x90-\xBF][\x80-\xBF]{2} | (?#planes 4-15)[\xF1-\xF3][\x80-\xBF]{3} | (?#plane 16)\xF4[\x80-\x8F][\x80-\xBF]{2}))." RS


            Regards,
            Flo
          • Axel Berger
            Thanks Alec, John and Flo, ... That s an interesting idea and your sample is more precise than my crude solution, though I have yet to understand some of the
            Message 5 of 10 , Jul 25, 2011
            • 0 Attachment
              Thanks Alec, John and Flo,

              "flo.gehrke" wrote:
              > It says: Find any character at a position where you DO NOT
              > see a legal UTF-8 when looking ahead. So this '^!Find' could
              > possibly match any illegal UTF-8:

              That's an interesting idea and your sample is more precise than my crude
              solution, though I have yet to understand some of the differences. But:
              I have some character follwed by a legal sequence. I next find the first
              by of that sequence followed by the truncated rest, which is not a legal
              sequence and get a false positive.

              John, the problem is a character may be illegal on its own but legal in
              a specific position of a two or three character UTF-sequence. My two
              examples for Ä demostrate this.

              Alec, illegal characters should be rare or nonexistant if I make no
              mistakes. I end several of my conversion clips with a find that
              highlights all possible problem cases for manual inspection. So this
              search must not destroy anything and must enable me to make manual
              corrections where needed. I need not be very precise, I typically
              inspect five to ten places for one needed correction.

              Axel
            • Axel Berger
              ... Flo, two questions: Your sequence seems to be (?! ). Shouldn t the dot come first .(?! ) ? And should not DOTALL be asserted first? It seems
              Message 6 of 10 , Jul 25, 2011
              • 0 Attachment
                "flo.gehrke" wrote:
                > ^!Find "(?x) (?!(?:(?#ASCII)[\x09\x0A\x0D\x20-\x7E] | (?#non-overlong 2-byte)[\xC2-\xDF][\x80-\xBF] | (?#excluding overlongs)\xE0[\xA0-\xBF][\x80-\xBF] | (?#straight 3-byte)[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | (?#excluding surrogates)\xED[\x80-\x9F][\x80-\xBF] | (?#planes 1-3)\xF0[\x90-\xBF][\x80-\xBF]{2} | (?#planes 4-15)[\xF1-\xF3][\x80-\xBF]{3} | (?#plane 16)\xF4[\x80-\x8F][\x80-\xBF]{2}))." RS

                Flo,

                two questions:
                Your sequence seems to be "(?!<a lot>)." Shouldn't the dot come first
                ".(?!<a lot>)"?
                And should not DOTALL be asserted first?
                It seems I also need to invoke UTF mode by (*UTF8).

                Do you agree?

                Danke
                Axel
              • flo.gehrke
                ... Axel, ... No! You can test that concept as follows: (?!A). will match any upper case letter that is no A (case-sensitive search). Whereas .(?!A) will
                Message 7 of 10 , Jul 25, 2011
                • 0 Attachment
                  --- In ntb-clips@yahoogroups.com, Axel Berger <Axel-Berger@...> wrote:
                  >
                  > Flo, two questions:...

                  Axel,

                  > Your sequence seems to be "(?!<a lot>)." Shouldn't the dot come
                  > first ".(?!<a lot>)"?

                  No! You can test that concept as follows:

                  '(?!A).' will match any upper case letter that is no 'A' (case-sensitive search). Whereas '.(?!A)' will match any character (including 'A' itself) that is not followed by an 'A'.

                  The logic behind this: The Negative Lookahead is an assertion, i.e. it doesn't consume any character. Thus the '.' and the 'no-A' are matching the same position.

                  > And should not DOTALL be asserted first?

                  DOTALL means that also a NL will be matched. I'm not quite sure if this is necessary. Sorry, I have to leave that question to encoding experts. This also pertains to...

                  > It seems I also need to invoke UTF mode by (*UTF8).

                  Maybe a test will show?

                  Regards,
                  Flo
                • flo.gehrke
                  ... Upss, sorry, That was to fast ;-) It matches ANY character that is no upper-case A (case-sensitive search) including lower-case a of course. Flo
                  Message 8 of 10 , Jul 25, 2011
                  • 0 Attachment
                    --- In ntb-clips@yahoogroups.com, "flo.gehrke" <flo.gehrke@...> wrote:
                    >
                    > '(?!A).' will match any upper case letter that is no 'A' (case-sensitive search)...

                    Upss, sorry, That was to fast ;-)

                    It matches ANY character that is no upper-case 'A' (case-sensitive search) including lower-case 'a' of course.

                    Flo
                  • Axel Berger
                    ... Got it, thanks. ... It did. I can t get it to work. Asserting UTF mode the whole sequence is seen as one byte, thus it never matches any of the multi-byte
                    Message 9 of 10 , Jul 25, 2011
                    • 0 Attachment
                      "flo.gehrke" wrote:
                      > The logic behind this: The Negative Lookahead is an
                      > assertion, i.e. it doesn't consume any character. Thus the
                      > '.' and the 'no-A' are matching the same position.

                      Got it, thanks.

                      > > It seems I also need to invoke UTF mode by (*UTF8).
                      > Maybe a test will show?

                      It did. I can't get it to work. Asserting UTF mode the whole sequence is
                      seen as one byte, thus it never matches any of the multi-byte sequences.
                      Not asserting UTF will check the first byte of a sequence, the truncated
                      rest of which is illegal.

                      Either way the search stops at every UTF occurrance. Seems it'll have to
                      be the loop after all, or no test before the upload and use an external
                      validator.

                      Danke
                      Axel
                    • Axel Berger
                      ... The following may not be the best solution, but it s the only one I ... ^!Find ([ x80- xBF]|[ xC0- xFF][ x80- xBF]*) RS ^!IfError usasc ^!IfMatch
                      Message 10 of 10 , Jul 25, 2011
                      • 0 Attachment
                        Axel Berger wrote:
                        > Seems it'll have to be the loop after all,

                        The following may not be the best solution, but it's the only one I
                        could get to work reliably:

                        :loop
                        ^!Find "([\x80-\xBF]|[\xC0-\xFF][\x80-\xBF]*)" RS
                        ^!IfError usasc
                        ^!IfMatch "[\xC2-\xDF][\x80-\xBF]" "^$GetSelection$" loop
                        ^!IfMatch "\xE0[\xA0-\xBF][\x80-\xBF]" "^$GetSelection$" loop
                        ^!IfMatch "[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}" "^$GetSelection$" loop
                        ^!IfMatch "\xED[\x80-\x9F][\x80-\xBF]" "^$GetSelection$" loop
                        ^!IfMatch "\xF0[\x90-\xBF][\x80-\xBF]{2}" "^$GetSelection$" loop
                        ^!IfMatch "[\xF1-\xF3][\x80-\xBF]{3}" "^$GetSelection$" loop
                        ^!IfMatch "\xF4[\x80-\x8F][\x80-\xBF]{2}" "^$GetSelection$" loop
                        ^!Continue no match
                        ^!Goto loop
                        :usasc
                        ^!Continue No errors found

                        With no errors being the normal case and illegal characters hopefully
                        rare I can live with sometimes having to start the clip several times.

                        Thanks for all the help.

                        Axel
                      Your message has been successfully submitted and would be delivered to recipients shortly.