Re: [Clip] Unicode to ANSI Conversion
- Hi Axel,
Thanks for the feedback. I am not crazy after all only RAM deficient.
But at least my senior citizen status gives me an excuse.<g>
I researched char codes and encoding many years ago but forgot
all about it and skipped right over the UTF-8 in the big letters of your
Although UTF-16 may not be common in your world, it is the Windows
RegEdit export format which is a major source of my files. Other than
a few infrequent e-mails I don't see UTF-8 (UTF-8 in the browsers is
transparent for me.) I will just have to get around to coding up the clip
to do the copy and save as approach.
BTW, did you try the BabblePad editor I posted yesterday. If anything
could handle UTF-8 it would be his app. I doubt there is anyone in the
world that knows more about character coding or character history
(especially Asian glyphs), than Andrew West.
At 1/13/2012 01:42 PM, you wrote:
>Art Kocsis wrote:
> > Just to make sure we are on the same page I am defining
> > a Unicode document as one that uses two bytes to encode
> > each character. Most of the time (for western documents),
> > the high byte is zero.
>Yes, you're right, that's UTF-16 and it's NOT what I'm dealing with here
>(and I believe nothing NoteTab can deal with well at all).
>What I'm writing for and about is UTF-8, the "up and coming" standard
>for the web. The 127 low ASCII characters, the bulk of all Western
>European text (i.e. except Greek and Cyrillic) are written as is, which
>more or less halves the size of texts written in UTF-16. All others are
>coded in two up to six bytes. It is this variability, that creates the
>problem. All bytes with the high bit set follwed by a low bit 10xxxxxx
>or $80 to $BF are a following byte and must not begin a sequence. A
>start byte has as many high bits as the length of the sequence followed
>by a low bit. The first 128 legal two-byte sequences (you could code the
>128 7-bit ones as two-byte, but that's declared illegal), i.e. $C2 or
>$C3 in the first byte or (1100001x-10xxxxxx) code latin-1.
>UTF-16 is not used much and, due to the constant sequence length of two
>bytes, is easy to deal with. UTF-8 breaks the rule of one character
>mapping to a fixed number of bytes, mostly one.