Loading ...
Sorry, an error occurred while loading the content.
 

Using U+XXXX notation

Expand Messages
  • John Shotsky
    How does one use the above format when converting Unicode characters to ANSI characters? An example of converting the 1/3 fraction character to three
    Message 1 of 3 , Jun 4, 2014

      How does one use the above format when converting Unicode characters to ANSI characters?

      An example of converting the 1/3 fraction character to three characters is enough to show me the needed format. I wish the NT help had actual examples of commands, so one could simply adapt them to needed use instead of guessing how to create the proper syntax.

      I'm using this:

      ^!Replace "(*UTF8)\x{2153}" >> "1/3" ARSW0

      I'd like to use:

      U+2153, but don't know how to write it.

      Your test file will have to be saved with a Unicode editor, such as Windows Notepad, not NotePad, which will convert it to a question mark.

      Here is an example character which will only be visible in html email, I suspect.

      Regards,
      John

       

       

    • flo.gehrke
      ... I think (*UTF8) x{2153} is the only legal notation in PCRE. So for searching U+2153 , probably, the only work-around is to convert the search string to
      Message 2 of 3 , Jun 4, 2014
        ---In ntb-clips@yahoogroups.com, <jshotsky@...> wrote :

        > I'm using this:

        > ^!Replace "(*UTF8)\x{2153}" >> "1/3" ARSW0

        > I'd like to use:

        > U+2153, but don't know how to write it.


        I think '(*UTF8)\x{2153}' is the only legal notation in PCRE.

        So for searching 'U+2153', probably, the only work-around is to convert the search string to that notation with something like...

        ^!Set %Str%=U+2153
        ^!Set %Str%=^$StrReplace("U+";"\x{";"^%Str%";I)$
        ^!Set %Str%=^$StrReplace("\d$";"$0}";"^%Str%";R)$
        ^!Info Search string: ^%Str%

        I didn't test it -- but probably this should work now...

        ^!Replace "(*UTF8)^%Str%" >> "1/3" WARS

        Regards,
        Flo

        P.S. Somehow, this reminds me of a topic of July 6, 2009...

        https://groups.yahoo.com/neo/groups/ntb-clips/conversations/topics/19390

        Though, in this case, Sheri's proposal probably doesn't match your needs it might be interesting in this context...
         
      • John Shotsky
        Thanks. I am indeed still using Sheri s code, but from reading the help, it seemed to me that there should be a U+XXXX notation that would be understood. From
        Message 3 of 3 , Jun 4, 2014

          Thanks. I am indeed still using Sheri's code, but from reading the help, it seemed to me that there should be a U+XXXX notation that would be understood.

          From the help:

          The sequences \h, \H, \v, and \V are features that were added to Perl at release 5.10. In contrast to the other sequences, which match only ASCII characters by default, these always match certain high-valued codepoints, whether or not PCRE_UCP is set. The horizontal space characters are:

            U+0009     Horizontal tab

            U+0020     Space

           U+00A0     Non-break space

            U+1680     Ogham space mark

            U+180E     Mongolian vowel separator

            U+2000     En quad

            U+2001     Em quad

            U+2002     En space

            U+2003     Em space

            U+2004     Three-per-em space

            U+2005     Four-per-em space

            U+2006     Six-per-em space

            U+2007     Figure space

            U+2008     Punctuation space

            U+2009     Thin space

            U+200A     Hair space

            U+202F     Narrow no-break space

            U+205F     Medium mathematical space

            U+3000     Ideographic space

          The vertical space characters are:

            U+000A     Linefeed

            U+000B     Vertical tab

            U+000C     Form feed

            U+000D     Carriage return

            U+0085     Next line

            U+2028     Line separator

            U+2029     Paragraph separator

          In 8-bit, non-UTF-8 mode, only the characters with codepoints less than 256 are relevant.

          Why document the U+XXXX terminology, when it is completely irrevalent to NoteTab? Why not use something that can at least be understood by users to help solve the terribly inconsistent Unicode character handling in NoteTab? And as I stated, some errors always occur, even when using the format I am using. Why do I keep having to revisit this code to write ever more converters to 'clean up' after the others, which is why I started this thread in the first place?

          Does this really mean that when a document is opened in a Unicode page, that replacing \h with [space] is always correct? And \v can be set to \r\n to ensure all line ends are the same? I just can't seem to get my arms around this dodgy Unicode handling, as it is poorly documented, and terminology is used to describe things which terminology isn't even part of NoteTab's own terminology.

          Ultimately, with the poor Unicode support in NoteTab, there should be a function (feature) that recognizes every character that cannot be understood by NoteTab, and that character should be converted to a character entity code which CAN be handled by regex. Thus, a smart double quote could be converted to an html character entity in the ASCII range, at which point regular NoteTab clips could look for those sequences and convert them to whatever is wanted, such as standard Ascii double quotes. This is simply a quagmire that never seems to be robust enough to handle everyday html.

           

          Regards,
          John
          RecipeTools Web Site: http://recipetools.gotdns.com/recipetools/
          John's Mags Yahoo Group:  http://groups.yahoo.com/group/johnsmags/

           

          From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com]
          Sent: Wednesday, June 04, 2014 17:20
          To: ntb-clips@yahoogroups.com
          Subject: [Clip] Re: Using U+XXXX notation

           

           

          ---In ntb-clips@yahoogroups.com, <jshotsky@...> wrote :

           

          > I'm using this:

          > ^!Replace "(*UTF8)\x{2153}" >> "1/3" ARSW0

          > I'd like to use:

          > U+2153, but don't know how to write it.


          I think '(*UTF8)\x{2153}' is the only legal notation in PCRE.

          So for searching 'U+2153', probably, the only work-around is to convert the search string to that notation with something like...

          ^!Set %Str%=U+2153
          ^!Set %Str%=^$StrReplace("U+";"\x{";"^%Str%";I)$
          ^!Set %Str%=^$StrReplace("\d$";"$0}";"^%Str%";R)$
          ^!Info Search string: ^%Str%

          I didn't test it -- but probably this should work now...

          ^!Replace "(*UTF8)^%Str%" >> "1/3" WARS

          Regards,
          Flo

          P.S. Somehow, this reminds me of a topic of July 6, 2009...

          https://groups.yahoo.com/neo/groups/ntb-clips/conversations/topics/19390

          Though, in this case, Sheri's proposal probably doesn't match your needs it might be interesting in this context...
           

        Your message has been successfully submitted and would be delivered to recipients shortly.