Loading ...
Sorry, an error occurred while loading the content.
 

Unicode to ANSI Conversion

Expand Messages
  • Art Kocsis
    I guess there is something (only one???) I don t understand about how NTB works. I tried loading a Unicode file with View | Options | General | Protect Unicode
    Message 1 of 3 , Jan 12, 2012
      I guess there is something (only one???) I don't understand about how NTB
      works.

      I tried loading a Unicode file with View | Options | General | Protect
      Unicode files option unchecked but the saved file is still consists of dual
      byte characters with the high byte equal to zero. The View | Options |
      Documents | Save As is set to DOS/Windows. I had expected, at least with
      these conditions that an edited file would comply with my settings to save
      the doc as ANSI but it doesn't.

      Following Axel's code I thought I could just do a simple substitution
      conversion and get rid of the high byte. I loaded the file with the Protect
      Unicode set to on and then unset the "read only" flag but nothing is matched:

      ^!Replace "[\x00-\x00]([\x00-\xFF])" >> "$1" ARSW
      ^!Replace "[\x00]([\x00-\xFF])" >> "$1" ARSW
      ^!Replace "\x00([\x00-\xFF])" >> "$1" ARSW

      The source file contains:
      000 FF FE 57 00 69 00 6E 00 64 00 6F 00 77 00 73 00 ¦W·i·n·d·o·w·s·
      010 20 00 52 00 65 00 67 00 69 00 73 00 74 00 72 00 ·R·e·g·i·s·t·r·

      Although the disk image is in little endian format, the Find should have
      found something in the memory image in either little or big endian format.

      The help file is a bit ambiguous. It states that

      "When NoteTab opens files, it has to convert them to the ANSI
      character set, which handles much fewer characters. ... When
      [the Protect Uncode] setting is checked, files are opened in
      Read-Only mode, which protects the file from changes."

      Does this mean that ALL (dual byte) Unicode files are loaded as (single
      byte) ANSI files irrespective of the option settings and all the Protect
      option does is set the "read-Only" flag? If so, how does Axel's clip work
      as it depends on a dual byte format?

      Picture me confused,

      Art
    • Axel Berger
      ... Yes, you re right, that s UTF-16 and it s NOT what I m dealing with here (and I believe nothing NoteTab can deal with well at all). What I m writing for
      Message 2 of 3 , Jan 13, 2012
        Art Kocsis wrote:
        > Just to make sure we are on the same page I am defining
        > a Unicode document as one that uses two bytes to encode
        > each character. Most of the time (for western documents),
        > the high byte is zero.

        Yes, you're right, that's UTF-16 and it's NOT what I'm dealing with here
        (and I believe nothing NoteTab can deal with well at all).

        What I'm writing for and about is UTF-8, the "up and coming" standard
        for the web. The 127 low ASCII characters, the bulk of all Western
        European text (i.e. except Greek and Cyrillic) are written as is, which
        more or less halves the size of texts written in UTF-16. All others are
        coded in two up to six bytes. It is this variability, that creates the
        problem. All bytes with the high bit set follwed by a low bit 10xxxxxx
        or $80 to $BF are a following byte and must not begin a sequence. A
        start byte has as many high bits as the length of the sequence followed
        by a low bit. The first 128 legal two-byte sequences (you could code the
        128 7-bit ones as two-byte, but that's declared illegal), i.e. $C2 or
        $C3 in the first byte or (1100001x-10xxxxxx) code latin-1.

        UTF-16 is not used much and, due to the constant sequence length of two
        bytes, is easy to deal with. UTF-8 breaks the rule of one character
        mapping to a fixed number of bytes, mostly one.

        Axel
      • Art Kocsis
        Hi Axel, Thanks for the feedback. I am not crazy after all only RAM deficient. But at least my senior citizen status gives me an excuse. I researched char
        Message 3 of 3 , Jan 13, 2012
          Hi Axel,

          Thanks for the feedback. I am not crazy after all only RAM deficient.
          But at least my senior citizen status gives me an excuse.<g>
          I researched char codes and encoding many years ago but forgot
          all about it and skipped right over the UTF-8 in the big letters of your
          title. Duh!!!

          Although UTF-16 may not be common in your world, it is the Windows
          RegEdit export format which is a major source of my files. Other than
          a few infrequent e-mails I don't see UTF-8 (UTF-8 in the browsers is
          transparent for me.) I will just have to get around to coding up the clip
          to do the copy and save as approach.

          BTW, did you try the BabblePad editor I posted yesterday. If anything
          could handle UTF-8 it would be his app. I doubt there is anyone in the
          world that knows more about character coding or character history
          (especially Asian glyphs), than Andrew West.

          Art


          At 1/13/2012 01:42 PM, you wrote:
          >Art Kocsis wrote:
          > > Just to make sure we are on the same page I am defining
          > > a Unicode document as one that uses two bytes to encode
          > > each character. Most of the time (for western documents),
          > > the high byte is zero.
          >
          >Yes, you're right, that's UTF-16 and it's NOT what I'm dealing with here
          >(and I believe nothing NoteTab can deal with well at all).
          >
          >What I'm writing for and about is UTF-8, the "up and coming" standard
          >for the web. The 127 low ASCII characters, the bulk of all Western
          >European text (i.e. except Greek and Cyrillic) are written as is, which
          >more or less halves the size of texts written in UTF-16. All others are
          >coded in two up to six bytes. It is this variability, that creates the
          >problem. All bytes with the high bit set follwed by a low bit 10xxxxxx
          >or $80 to $BF are a following byte and must not begin a sequence. A
          >start byte has as many high bits as the length of the sequence followed
          >by a low bit. The first 128 legal two-byte sequences (you could code the
          >128 7-bit ones as two-byte, but that's declared illegal), i.e. $C2 or
          >$C3 in the first byte or (1100001x-10xxxxxx) code latin-1.
          >
          >UTF-16 is not used much and, due to the constant sequence length of two
          >bytes, is easy to deal with. UTF-8 breaks the rule of one character
          >mapping to a fixed number of bytes, mostly one.
        Your message has been successfully submitted and would be delivered to recipients shortly.