Loading ...
Sorry, an error occurred while loading the content.

Working with html, xhtml, xml in NoteTab

Expand Messages
  • John Shotsky
    There are a great many characters and symbols that appear in the abovementioned formats, but NoteTab only understands the first 256 of them. In addition, html
    Message 1 of 3 , Jan 5, 2014
    • 0 Attachment

      There are a great many characters and symbols that appear in the abovementioned formats, but NoteTab only understands the first 256 of them. In addition, html code itself reserves 5 of THOSE characters for its markup, so you can't use those characters unencoded in your text. (left and right angle brackets, quotes, apostrophes and the ampersand symbol). The w3c schools is a good reference for most things internet, so this link will help understand how characters should be coded when used on the internet or in an ebook.

      http://www.w3schools.com/tags/ref_entities.asp

      It is when the characters above 255 are used that NoteTab regex will stop working. It has always been that way. It has nothing at all to do with NoteTab, it is the regex engine itself, which has always been a plain text processor.

      For a little information about what it would take to use a Unicode regex engine, check out the following:

      http://www.unicode.org/reports/tr18/

      Regards,
      John
      RecipeTools Web Site: http://recipetools.gotdns.com/
      John's Mags Yahoo Group:  http://groups.yahoo.com/group/johnsmags/

       

      From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of John Shotsky
      Sent: Sunday, January 05, 2014 09:01
      To: ntb-clips@yahoogroups.com
      Subject: RE: [Clip] RE: About my non-working 'clips', from some months back

       

       

      One way to isolate this problem is to open the (original) document with Windows notepad, which IS a Unicode editor. You could then replace high-order characters using find/replace sequences. But for THIS document, you would not replace smart quotes with single quotes (apostrophes), because single quotes are used in the html coding itself. So, one would have to use ' to replace the smart quotes, and then do the same for any other characters that are above the first 256.

       

      It is easy to know which characters are outside the basic ascii set - the basic set comprises 256 characters - the number of characters that can be expressed in one 8-bit byte. All the rest of the millions of the world's characters use at least two bytes. NoteTab does not work on any two-byte characters. The exception to this is that it can recognize any form of line termination - line feed, carriage return, or both, using \R. All other bytes will not be honored by regex, because simply, it works on single, 8-bit bytes at a time. If you want to work on something above that, you have to recode it somehow. You can change single character fractions to three character fractions. You can change all higher order characters to character entities, which use only characters in the first 256 to encode all other characters.

      Bottom line, 'plain text' documents code every possible character with only one byte of data. Any two-byte sequence makes the document NOT plain text, and it will not work with PCRE Regex.

       

      Regards,
      John
      RecipeTools Web Site: http://recipetools.gotdns.com/
      John's Mags Yahoo Group:  http://groups.yahoo.com/group/johnsmags/

       

      From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Don
      Sent: Sunday, January 05, 2014 08:19
      To: ntb-clips@yahoogroups.com
      Subject: Re: [Clip] RE: About my non-working 'clips', from some months back

       

       

      I also get two "funny" characters. I believe they are apostrophes with
      curve to them if you will based on context of where they appear.

      Secondarily I have hard returns throughout the text as I asked about as
      a possibility yesterday.

      If I remove the two special characters and then include newlines (?s) it
      works as I suggested yesterday. If I don't it won't.

      (?s)<span lang=EN-GB style='background:white'>(.*?)</span>

      I'll say this again, to check your regex open an empty file and type in
      the alphabet.

      Use regex to find an a.

      Then find a range: [a-c]

      If those are working you know that regex is actually working and your
      ntp is likely not corrupt as you suggest. I don't see that you followed
      this suggestion, though you may have and I missed the report back.

    • Axel Berger
      ... Not necessarily, only since NT introduced its ill advidsed UTF support . Before that a sequence of several bytes was just that and Regex dealt with it
      Message 2 of 3 , Jan 5, 2014
      • 0 Attachment
        John Shotsky wrote:
        > It is when the characters above 255 are used that NoteTab regex
        > will stop working.

        Not necessarily, only since NT introduced its ill advidsed UTF "support".
        Before that a sequence of several bytes was just that and Regex dealt with
        it just like any other sequence. But now what you see, what the status line
        says is there, and what you can copy and paste is NOT what's really there
        in the text. There already is a command line option to turn this junk off,
        but we need a permanent setup entry too.

        Axel
      • loro
        ... Amen! Lotta
        Message 3 of 3 , Jan 6, 2014
        • 0 Attachment
          Axel Berger wrote:
          >Not necessarily, only since NT introduced its ill advidsed UTF "support".
          >Before that a sequence of several bytes was just that and Regex dealt with
          >it just like any other sequence. But now what you see, what the status line
          >says is there, and what you can copy and paste is NOT what's really there
          >in the text. There already is a command line option to turn this junk off,
          >but we need a permanent setup entry too.

          Amen!

          Lotta
        Your message has been successfully submitted and would be delivered to recipients shortly.