  • Axel Berger
    Oct 10, 2012
      Marcelo Bastos wrote:
      > The problem: if there were Unicode characters there, you lost them.

      Which is why that's not the way to do it. Hope the following is correct
      (i.e. works first time), I really hate this "feature". You can
      a) Open the file as codepage (UTF-8 (no conversion)" and possibly also
      switch off document --> Read only.
      b) Open an empty document and your page in another editor and copy and
      paste all of it over.

      To get rid of the UTF characters and convert them to HTML entities you
      can run this clip:

      ^!Find "[\xC0-\xF7][\x80-\xBF]*" RS
      ^!IfError donelatin
      ^!IfMatch "[\xC2-\xC3][\x80-\xBF]" "^$GetSelection$" latin1
      ^!IfMatch "[\xC0-\xDF][\x80-\xBF]" "^$GetSelection$" zwei
      ^!IfMatch "[\xE0-\xEF][\x80-\xBF]{2}" "^$GetSelection$" drei
      ^!IfMatch "[\xF0-\xF7][\x80-\xBF]{3}" "^$GetSelection$" vier
      ^!Continue Illegal sequence, can't be converted.
      ^!Goto loop
      ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
      ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
      ^!Set %third%=0
      ^!Set %fourth%=0
      ^!Goto makeent
      ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";3)$)$ MOD
      ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
      ^!Set %third%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
      ^!Set %fourth%=0
      ^!Goto makeent
      ^!Set %first%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";4)$)$ MOD
      ^!Set %second%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";3)$)$ MOD
      ^!Set %third%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";2)$)$ MOD
      ^!Set %fourth%=^$Calc(^$CharToDec(^$StrIndex("^$GetSelection$";1)$)$ MOD
      ^!InsertText &#^%first%;
      ^!Goto loop
      ^!Set %first%=^$StrCopyRight("^$GetSelection$";1)$
      ^!Set %second%=^$StrCopyLeft("^$GetSelection$";1)$
      ^!Set %first%=^$Calc(^$CharToDec(^%first%)$ MOD 64)$
      ^!Set %second%=^$Calc(^$CharToDec(^%second%)$ MOD 4)$
      ^!InsertText ^$DecToChar(^$Calc(64*^%second%+^%first%)$)$
      ^!Goto loop
      ^!Replace "€" >> "€" WASTI
      ^!Replace "Š" >> "Š" WASTI
      ^!Replace "š" >> "š" WASTI
      ^!Replace "Ž" >> "Ž" WASTI
      ^!Replace "ž" >> "ž" WASTI
      ^!Replace "Œ" >> "Œ" WASTI
      ^!Replace "œ" >> "œ" WASTI
      ^!Replace "Ÿ" >> "Ÿ" WASTI

      Beware of broken long lines. Each line begins with either "^" or ":".


      > then have to figure out what they were and where they went originally.
      > And then you have to find out the character entities for them and enter
      > them manually.
      > One way to do that, I found, is by using Microsoft Word. Open the
      > original file in Word, save it as "Web page, filtered." Word is pretty
      > useless as a HTML editor, but it does have good Unicode support, and it
      > will usually convert Unicode to a Win-1252 file with all the
      > 1252-incompatible characters to HTML numbered entities. Then you open
      > this file in Notepad, search for "&#", and there you have it, the
      > mystery characters.
      > And that is the second reason I still keep Word in my computer, since I
      > hardly ever use it for writing nowadays. (The first reason is that the
      > file-compare feature in Word is pretty kickass, and I have to compare
      > files now and then).
