24305RE: [Clip] Re: cleaning extra spaces
- Jan 18, 2014
I misspoke about non-break spaces being Unicode. While they are Unicode characters, they are also Ansi characters which work just fine in NoteTab, but must be explicitly called out as I showed in the revised clip below. If you have characters above the Ansi range (0-255), then you have a whole nother problem to solve. One that I have already done, but it take a potful of code to deal with full Unicode in NoteTab's regex. My code either converts high-order characters to Ansi characters, or simply omits them entirely, thus removing, for example Asian characters from the text entirely.
\x20 is the written code for a space. Much easier to understand in email, and NoteTab accepts either.
As to replacing CR's after a '>', it should not have. The clip says it cannot stop on either a space or a >. So, the CR cannot be preceded by either spaces or > before the \K which simply means don't capture anything before the \K. It provides a stop point from which to proceed. Any character can be before the \K except a space or >. Following the \K, any combination of spaces with at least one CR will trigger the replacement. The replacement is one space. If it removed the CR's, it added the spaces. I can't imagine any 'problem' in which it removed the CR's and didn't insert the spaces - they have to be there or it would not have triggered. When you say they 'reappeared', I am confused, because nothing can 'reappear' that is not already there. If you are using NoteTab Pro, you can see the spaces as dots.
The only thing I can think of that could cause any inconsistency is if they not actually spaces, but non-break spaces, which are Unicode characters which would not function in this clip as is. The way around that it to convert them to spaces first, or include them in each location in which there are spaces above.
To convert them first:
^!Replace " " >> " " AIRSW
^!IfError Next Else Skip_-1
After this, all spaces are normal spaces, and the code provided will run as expected. Otherwise, add   to the command as follows:
^!Replace "[^>\x20 ]\K[\x20 ]*\R+[\x20 ]*" >> "\x20" ARSW
I would have to see the original html to know what is happening for sure, but if spaces in html are not acting as expected, there is a good chance they are non-breaking spaces which has no shortcut in regex that doesn't include the \R.
You guys are magnificent. John even spotted something I knew but forgot: that I want to replace the carriage returns with a space rather than simply suppress it. And yes, you are right, the email reformatted the text so that it appeared as it should.
The code you gave me: ^!Replace "[^>\x20]\K\x20*\R+\x20*" >> "\x20" ARSW worked with one problem. The spaces vanished but so did all the line breaks. When I ran it through a web editor, however, these reappeared. It did its own beautifying, and when I reopened the saved text in NoteTab all was fine.
I can live with this two-step approach, if there is no alternative, but I am puzzled why [^`>] still replaced the linebreaks after >, and I don't really understand \K or should it be \K\20.
- << Previous post in topic Next post in topic >>