  • Axel Berger
    Oct 10, 2012
      Marcelo Bastos wrote:
      > I had a quick look at the logic, and it seems to be generic enough to
      > tackle the entire Basic Multilingual Plane.

      Even more than that, it will also translate illegal UTF into equally
      illegal entities. I have another clip that checks a document for legal
      UTF and flags errors such as ANSI characters.

      ^!Find "([\x80-\xBF]|[\xC0-\xFF][\x80-\xBF]*)" RS
      ^!IfError usasc
      ^!IfMatch "[\xC2-\xDF][\x80-\xBF]" "^$GetSelection$" loop
      ^!IfMatch "\xE0[\xA0-\xBF][\x80-\xBF]" "^$GetSelection$" loop
      ^!IfMatch "[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}" "^$GetSelection$" loop
      ^!IfMatch "\xED[\x80-\x9F][\x80-\xBF]" "^$GetSelection$" loop
      ^!IfMatch "\xF0[\x90-\xBF][\x80-\xBF]{2}" "^$GetSelection$" loop
      ^!IfMatch "[\xF1-\xF3][\x80-\xBF]{3}" "^$GetSelection$" loop
      ^!IfMatch "\xF4[\x80-\x8F][\x80-\xBF]{2}" "^$GetSelection$" loop
      ^!Continue Illegal sequence, no UTF-8
      ^!Goto loop
      ^!Continue No errors found

