Working with html, xhtml, xml in NoteTab
There are a great many characters and symbols that appear in the abovementioned formats, but NoteTab only understands the first 256 of them. In addition, html code itself reserves 5 of THOSE characters for its markup, so you can't use those characters unencoded in your text. (left and right angle brackets, quotes, apostrophes and the ampersand symbol). The w3c schools is a good reference for most things internet, so this link will help understand how characters should be coded when used on the internet or in an ebook.
It is when the characters above 255 are used that NoteTab regex will stop working. It has always been that way. It has nothing at all to do with NoteTab, it is the regex engine itself, which has always been a plain text processor.
For a little information about what it would take to use a Unicode regex engine, check out the following:
One way to isolate this problem is to open the (original) document with Windows notepad, which IS a Unicode editor. You could then replace high-order characters using find/replace sequences. But for THIS document, you would not replace smart quotes with single quotes (apostrophes), because single quotes are used in the html coding itself. So, one would have to use ' to replace the smart quotes, and then do the same for any other characters that are above the first 256.
It is easy to know which characters are outside the basic ascii set - the basic set comprises 256 characters - the number of characters that can be expressed in one 8-bit byte. All the rest of the millions of the world's characters use at least two bytes. NoteTab does not work on any two-byte characters. The exception to this is that it can recognize any form of line termination - line feed, carriage return, or both, using \R. All other bytes will not be honored by regex, because simply, it works on single, 8-bit bytes at a time. If you want to work on something above that, you have to recode it somehow. You can change single character fractions to three character fractions. You can change all higher order characters to character entities, which use only characters in the first 256 to encode all other characters.
Bottom line, 'plain text' documents code every possible character with only one byte of data. Any two-byte sequence makes the document NOT plain text, and it will not work with PCRE Regex.
I also get two "funny" characters. I believe they are apostrophes with
curve to them if you will based on context of where they appear.
Secondarily I have hard returns throughout the text as I asked about as
a possibility yesterday.
If I remove the two special characters and then include newlines (?s) it
works as I suggested yesterday. If I don't it won't.
(?s)<span lang=EN-GB style='background:white'>(.*?)</span>
I'll say this again, to check your regex open an empty file and type in
Use regex to find an a.
Then find a range: [a-c]
If those are working you know that regex is actually working and your
ntp is likely not corrupt as you suggest. I don't see that you followed
this suggestion, though you may have and I missed the report back.
- John Shotsky wrote:
> It is when the characters above 255 are used that NoteTab regexNot necessarily, only since NT introduced its ill advidsed UTF "support".
> will stop working.
Before that a sequence of several bytes was just that and Regex dealt with
it just like any other sequence. But now what you see, what the status line
says is there, and what you can copy and paste is NOT what's really there
in the text. There already is a command line option to turn this junk off,
but we need a permanent setup entry too.
- Axel Berger wrote:
>Not necessarily, only since NT introduced its ill advidsed UTF "support".Amen!
>Before that a sequence of several bytes was just that and Regex dealt with
>it just like any other sequence. But now what you see, what the status line
>says is there, and what you can copy and paste is NOT what's really there
>in the text. There already is a command line option to turn this junk off,
>but we need a permanent setup entry too.