Loading ...
Sorry, an error occurred while loading the content.

Re: [Clip] Unicode to ANSI Conversion

Expand Messages
  • Art Kocsis
    Hi Axel, Thanks for the feedback. I am not crazy after all only RAM deficient. But at least my senior citizen status gives me an excuse. I researched char
    Message 1 of 3 , Jan 13, 2012
    • 0 Attachment
      Hi Axel,

      Thanks for the feedback. I am not crazy after all only RAM deficient.
      But at least my senior citizen status gives me an excuse.<g>
      I researched char codes and encoding many years ago but forgot
      all about it and skipped right over the UTF-8 in the big letters of your
      title. Duh!!!

      Although UTF-16 may not be common in your world, it is the Windows
      RegEdit export format which is a major source of my files. Other than
      a few infrequent e-mails I don't see UTF-8 (UTF-8 in the browsers is
      transparent for me.) I will just have to get around to coding up the clip
      to do the copy and save as approach.

      BTW, did you try the BabblePad editor I posted yesterday. If anything
      could handle UTF-8 it would be his app. I doubt there is anyone in the
      world that knows more about character coding or character history
      (especially Asian glyphs), than Andrew West.

      Art


      At 1/13/2012 01:42 PM, you wrote:
      >Art Kocsis wrote:
      > > Just to make sure we are on the same page I am defining
      > > a Unicode document as one that uses two bytes to encode
      > > each character. Most of the time (for western documents),
      > > the high byte is zero.
      >
      >Yes, you're right, that's UTF-16 and it's NOT what I'm dealing with here
      >(and I believe nothing NoteTab can deal with well at all).
      >
      >What I'm writing for and about is UTF-8, the "up and coming" standard
      >for the web. The 127 low ASCII characters, the bulk of all Western
      >European text (i.e. except Greek and Cyrillic) are written as is, which
      >more or less halves the size of texts written in UTF-16. All others are
      >coded in two up to six bytes. It is this variability, that creates the
      >problem. All bytes with the high bit set follwed by a low bit 10xxxxxx
      >or $80 to $BF are a following byte and must not begin a sequence. A
      >start byte has as many high bits as the length of the sequence followed
      >by a low bit. The first 128 legal two-byte sequences (you could code the
      >128 7-bit ones as two-byte, but that's declared illegal), i.e. $C2 or
      >$C3 in the first byte or (1100001x-10xxxxxx) code latin-1.
      >
      >UTF-16 is not used much and, due to the constant sequence length of two
      >bytes, is easy to deal with. UTF-8 breaks the rule of one character
      >mapping to a fixed number of bytes, mostly one.
    Your message has been successfully submitted and would be delivered to recipients shortly.