Loading ...
Sorry, an error occurred while loading the content.

2389Re: Vim on OS X, (no)macatsui problem

Expand Messages
  • Kenneth Beesley
    Oct 15, 2007
    • 0 Attachment

      Great message, as usual.
      I insert some friendly comments below.

      On 13 Oct 2007, at 18:30, Tony Mechelynck wrote:

      > björn wrote:
      >>>> He also reports that mapping numbers `:map 3 ...` doesn't work. I
      >>>> can't reproduce this.
      >>> I got this one wrong. See the other thread for Kenneth's
      >>> clarification. Sorry.
      >> Hi Ken,
      >> I have looked into why MacVim fails to render the deseret glyphs
      >> and I
      >> now have an answer, but unfortunately no solution.
      >> The problem is that one deseret character for some reason takes up
      >> _two_ characters when put in the text storage (I guess this have
      >> something to do with Unicode?). Specifically, calling "length" on an
      >> NSString containing one deseret character returns 2 instead of 1,
      >> as I
      >> would expect.
      >> Now, I do know how to fix this problem, but since Jiang is working on
      >> moving his drawing code to MacVim I don't really want to spend any
      >> time doing this, since the problem will disappear as soon as he is
      >> finished. I'm sorry about that.
      >> /Björn

      Tony responds:
      > UTF-8 uses:
      > 1 byte for each codepoint in the range U+0000 - U+007F
      > 2 bytes for each codepoint in the range U+0080 - U+07FF
      > 3 bytes for each codepoint in the range U+0800 - U+FFFF
      > 4 bytes for each codepoint in the range U+10000 - U+1FFFFF

      KRB: The current modern Unicode character set has code point
      values ranging from U+0 to U+10FFFF, allowing about a million
      distinct "characters". These Unicode Characters are slightly abstract
      and need to be distinguished carefully from how they are "encoded"
      in a file or in a programming language. In UTF-8 encoding, the "code
      is one byte, and each Unicode character (each code point value) is
      stored in one
      to four bytes as you describe above. The conversion between code point
      values (integers) and the bit/byte representations requires some trivlal
      bit extraction and shifting.

      What Bjôrn describes sounds more like UTF-16, where each Unicode
      character (code point value) is stored in either one 16-bit "code unit"
      or in two 16-bit code units. Characters from the Basic Multilingual
      U+0 to U+FFFF, are stored in a single 16-bit code unit. Supplementary
      characters, those beyond the Basic Multilingual Plane, are stored in two
      16-bit code units. (Again there is some trivial bit manipulation
      in conversion between code point values and the bit representations
      in the 16-bit code units.)

      Perl stores Unicode strings internally as UTF-8, but you should
      hardly ever
      have to know that. If you ask for the length of a Perl Unicode
      string, Perl gives
      you the length in Unicode Characters. If you loop through the
      characters in
      a Perl Unicode string, it loops through Unicode Characters, taking
      care of
      the underlying UTF-8 encoding in the background. The underlying
      in UTF-8 is effectively hidden from the programmer. At the
      programming level,
      you can always think of a Perl Unicode string as a sequence of Unicode
      Characters (including supplementary characters).

      Java from the very beginning took Unicode very seriously. But Java
      in the olden days of Unicode, when code point values ranged only from
      U+0 to
      U+FFFF, so every original Unicode character could be stored in a single
      16-bit "char". The length of a Unicode string was simply the number
      of chars.
      Easy and clean.
      The introduction of supplementary Unicode characters 10 years ago
      quite a challenge for Java and other programming languages that wanted
      to take Unicode seriously. Instead of accommodating the New Unicode by
      making char 32 bits (which would allow each New Unicode character to be
      stored straightforwardly in a single 32-bit char) the Java gurus
      opted to keep "char" at 16-bits
      and use UTF-16 to store Unicode strings. If you ask for the "length"
      of a Unicode
      string in Java, it still returns the length in chars rather than the
      length in Unicode
      Characters. This is (arguably) quite a mess, and you have to be very
      aware of it
      as a programmer if you want to handle Supplementary Unicode Characters.

      The way that Python handles Unicode strings internally depends on how
      it is configured/built. If configured for "ucs2", Python stores
      Unicode strings as
      UTF-16, returns the "length" of strings as the number of 16-bit code
      units, and
      if you try to loop through the elements of a string, it loops through
      values, which creates a mess if your string contains supplementary
      This is comparable to the situation in Java.

      If you configure Python for "ucs4", then each Unicode string is
      stored internally as
      a string of 32-bit code units, "length" is returned as the number of
      Unicode characters, and if you loop through the characters in a
      string, you
      get one Unicode character (code point value) at a time, even for
      characters. This "ucs4" option is now formally termed UTF-32 in Unicode

      > Actually, current standards mandate that no codepoints higher than U
      > +10FFFD
      > will "ever" be used. (Vim supports up to U+3FFFFFFF, with up to 6
      > bytes per
      > codepoint, following an earlier draft of the standard.)
      > Unicode also has the notion of "composing characters", which are
      > characters
      > which are "superimposed" on the preceding character, possibly
      > changing its
      > shape. These are usually diacritics: most of the accents of Latin
      > can be
      > either precomposed or spacing-non-accented + composing-accent, but the
      > optional vowel marks of Hebrew and Arabic exist only as composing
      > characters.

      Quite right. "Character" is a technical term in Unicode, and
      includes spaces,
      punctuation and these Composing Diacritical Marks (block starting U
      that might not fall under the everyday notion of character. An
      acute-accented é,
      for example, can be represented in Unicode either as a single character,


      which has the name LATIN SMALL LETTER E WITH ACUTE

      You can alternatively represent é as a sequence of two Unicode


      The Unicode gods have explicitly decreed that these two
      representations are
      equivalent, which means that any proper Unicode-capable editor should
      handle and display them equivalently.

      In Hopi (spoken in Arizona) orthography (as defined at the University of
      Arizona), you have some double-accented graphemes like o with both
      diaeresis and an acute, grave or circumflex accent. In Unicode you
      can represent o with diaeresis and acute (the acute accent is rendered
      above the diaeresis) as either the three-character sequence


      or as the two-character sequence


      But there is no single "pre-composed" Unicode character for this

      This whole issue of Combining Diacritical Marks is separate from the
      of encoding (UTF-8, UTF-16 or UTF-32). Some conversion between "pre-
      and "decomposed" representations can be done using "Normalization"
      available in Perl, Python, Java, ICU, etc.

      These Combining Diacritical Marks need to be rendered above or below,
      or attached in particular places, as appropriate, to any letter
      character. For
      that to work properly, you need a font (e.g. Doulos SIL or Charis
      SIL) that
      contains the diacritic-positioning information, and you need a
      sophisticated rendering
      engine (as in XeTeX) that reads and uses that diacritic-positioning

      Most software, including text editors, still do a poor job of handling
      Combining Diacritical Marks and supplementary characters in general.

      > Since your Deseret characters are outside the BMP, each of them
      > requires 4
      > bytes in UTF-8 (also two 16-bit words in UTF-16 and one 32-bit
      > doubleword in
      > UTF-32); but maybe that's not what your measured "length" means?
      > Does your
      > NSString include a final null (as C strings do) or an initial
      > bytecount (as
      > Pascal strings do)? Or do your Deseret characters include
      > "composing" elements?

      Because the "length" of each Deseret Character is being returned as 2
      than 1, it sounds like the MacVim code is using a Java-like UTF-16
      internal representation
      for storing Unicode characters (including supplementary characters).

      There are no Combining Diacritical Marks required in the traditional
      Deseret Alphabet, per se,
      although proper rendering software _should_ allow you to associate
      one or more Combining
      Diacritics Marks with any letter character and have it rendered
      acceptably. (Handling
      combining diacritical marks with Deseret Alphabet is very low priority.)

      Each Deseret Alphabet letter is a single Unicode character, with a
      single code
      point value in the supplementary area (block starting U+10400). The
      Shavian alphabet is
      much the same (in the block starting U+10450). The glyphs are
      straightforward, rendered
      left-to-right, requiring no ligatures, and could be forced into a
      fixed-pitch (mono) font about
      as easily as Roman glyphs.


      > Best regards,
      > Tony.
      > --
      > hundred-and-one symptoms of being an internet addict:
      > 55. You ask your doctor to implant a gig in your brain.
      > >

      You received this message from the "vim_multibyte" maillist.
      For more information, visit http://www.vim.org/maillist.php
    • Show all 17 messages in this topic