Loading ...
Sorry, an error occurred while loading the content.
Skip to search.

2397Re: Vim on OS X, (no)macatsui problem

Expand Messages
  • Tony Mechelynck
    Oct 16, 2007
      Kenneth Beesley wrote:
      >
      > Tony,
      >
      > Great message, as usual.
      > I insert some friendly comments below.
      >
      > On 13 Oct 2007, at 18:30, Tony Mechelynck wrote:
      >
      >> björn wrote:
      >>>>> He also reports that mapping numbers `:map 3 ...` doesn't work. I
      >>>>> can't reproduce this.
      >>>> I got this one wrong. See the other thread for Kenneth's
      >>>> clarification. Sorry.
      >>> Hi Ken,
      >>>
      >>> I have looked into why MacVim fails to render the deseret glyphs
      >>> and I
      >>> now have an answer, but unfortunately no solution.
      >>>
      >>> The problem is that one deseret character for some reason takes up
      >>> _two_ characters when put in the text storage (I guess this have
      >>> something to do with Unicode?). Specifically, calling "length" on an
      >>> NSString containing one deseret character returns 2 instead of 1,
      >>> as I
      >>> would expect.
      >>>
      >>> Now, I do know how to fix this problem, but since Jiang is working on
      >>> moving his drawing code to MacVim I don't really want to spend any
      >>> time doing this, since the problem will disappear as soon as he is
      >>> finished. I'm sorry about that.
      >>>
      >>>
      >>> /Björn
      >
      > Tony responds:
      >> UTF-8 uses:
      >> 1 byte for each codepoint in the range U+0000 - U+007F
      >> 2 bytes for each codepoint in the range U+0080 - U+07FF
      >> 3 bytes for each codepoint in the range U+0800 - U+FFFF
      >> 4 bytes for each codepoint in the range U+10000 - U+1FFFFF
      >
      > KRB: The current modern Unicode character set has code point
      > values ranging from U+0 to U+10FFFF, allowing about a million
      > distinct "characters". These Unicode Characters are slightly abstract
      > and need to be distinguished carefully from how they are "encoded"
      > in a file or in a programming language. In UTF-8 encoding, the "code
      > unit"
      > is one byte, and each Unicode character (each code point value) is
      > stored in one
      > to four bytes as you describe above. The conversion between code point
      > values (integers) and the bit/byte representations requires some trivlal
      > bit extraction and shifting.
      >
      > What Bjôrn describes sounds more like UTF-16, where each Unicode
      > character (code point value) is stored in either one 16-bit "code unit"
      > or in two 16-bit code units. Characters from the Basic Multilingual
      > Plane,
      > U+0 to U+FFFF, are stored in a single 16-bit code unit. Supplementary
      > characters, those beyond the Basic Multilingual Plane, are stored in two
      > 16-bit code units. (Again there is some trivial bit manipulation
      > involved
      > in conversion between code point values and the bit representations
      > in the 16-bit code units.)
      >
      > Perl stores Unicode strings internally as UTF-8, but you should
      > hardly ever
      > have to know that. If you ask for the length of a Perl Unicode
      > string, Perl gives
      > you the length in Unicode Characters. If you loop through the
      > characters in
      > a Perl Unicode string, it loops through Unicode Characters, taking
      > care of
      > the underlying UTF-8 encoding in the background. The underlying
      > encoding
      > in UTF-8 is effectively hidden from the programmer. At the
      > programming level,
      > you can always think of a Perl Unicode string as a sequence of Unicode
      > Characters (including supplementary characters).
      >
      > Java from the very beginning took Unicode very seriously. But Java
      > emerged
      > in the olden days of Unicode, when code point values ranged only from
      > U+0 to
      > U+FFFF, so every original Unicode character could be stored in a single
      > 16-bit "char". The length of a Unicode string was simply the number
      > of chars.
      > Easy and clean.
      > The introduction of supplementary Unicode characters 10 years ago
      > created
      > quite a challenge for Java and other programming languages that wanted
      > to take Unicode seriously. Instead of accommodating the New Unicode by
      > making char 32 bits (which would allow each New Unicode character to be
      > stored straightforwardly in a single 32-bit char) the Java gurus
      > opted to keep "char" at 16-bits
      > and use UTF-16 to store Unicode strings. If you ask for the "length"
      > of a Unicode
      > string in Java, it still returns the length in chars rather than the
      > length in Unicode
      > Characters. This is (arguably) quite a mess, and you have to be very
      > aware of it
      > as a programmer if you want to handle Supplementary Unicode Characters.

      Hm. I guess I'll stay with Vim and vim-script, where I know what to expect.

      >
      > The way that Python handles Unicode strings internally depends on how
      > it is configured/built. If configured for "ucs2", Python stores
      > Unicode strings as
      > UTF-16, returns the "length" of strings as the number of 16-bit code
      > units, and
      > if you try to loop through the elements of a string, it loops through
      > 16-bit
      > values, which creates a mess if your string contains supplementary
      > characters.
      > This is comparable to the situation in Java.
      >
      > If you configure Python for "ucs4", then each Unicode string is
      > stored internally as
      > a string of 32-bit code units, "length" is returned as the number of
      > Unicode characters, and if you loop through the characters in a
      > string, you
      > get one Unicode character (code point value) at a time, even for
      > supplementary
      > characters. This "ucs4" option is now formally termed UTF-32 in Unicode
      > circles.
      >
      >
      >
      >> Actually, current standards mandate that no codepoints higher than U
      >> +10FFFD
      >> will "ever" be used. (Vim supports up to U+3FFFFFFF, with up to 6
      >> bytes per
      >> codepoint, following an earlier draft of the standard.)
      >>
      >> Unicode also has the notion of "composing characters", which are
      >> characters
      >> which are "superimposed" on the preceding character, possibly
      >> changing its
      >> shape. These are usually diacritics: most of the accents of Latin
      >> can be
      >> either precomposed or spacing-non-accented + composing-accent, but the
      >> optional vowel marks of Hebrew and Arabic exist only as composing
      >> characters.
      >
      > Quite right. "Character" is a technical term in Unicode, and
      > includes spaces,
      > punctuation and these Composing Diacritical Marks (block starting U
      > +0300)
      > that might not fall under the everyday notion of character. An

      also control characters (carriage return, line feed, form feed, horizontal
      tab, soft hyphen, byte-order mark, zero-width joiner, etc.), which also might
      not all fall under the everyday notion of "character".

      > acute-accented é,
      > for example, can be represented in Unicode either as a single character,
      >
      > U+00E9
      >
      > which has the name LATIN SMALL LETTER E WITH ACUTE
      >
      > You can alternatively represent é as a sequence of two Unicode
      > characters
      >
      > U+0065 LATIN SMALL LETTER E
      > U+0301 COMBINING ACUTE ACCENT
      >
      > The Unicode gods have explicitly decreed that these two
      > representations are
      > equivalent, which means that any proper Unicode-capable editor should
      > handle and display them equivalently.
      >
      > In Hopi (spoken in Arizona) orthography (as defined at the University of
      > Arizona), you have some double-accented graphemes like o with both
      > diaeresis and an acute, grave or circumflex accent. In Unicode you
      > can represent o with diaeresis and acute (the acute accent is rendered
      > above the diaeresis) as either the three-character sequence
      >
      > U+006F UNICODE SMALL LETTER O
      > U+0308 COMBINING DIAERESIS
      > U+0301 COMBINING ACUTE ACCENT
      >
      > or as the two-character sequence
      >
      > U+00F6 LATIN SMALL LETTER O WITH DIAERESIS
      > U+0301 COMBINING ACUTE ACCENT
      >
      > But there is no single "pre-composed" Unicode character for this
      > purpose.
      >
      > This whole issue of Combining Diacritical Marks is separate from the
      > issue
      > of encoding (UTF-8, UTF-16 or UTF-32). Some conversion between "pre-
      > composed"
      > and "decomposed" representations can be done using "Normalization"
      > routines
      > available in Perl, Python, Java, ICU, etc.

      but not in Vim. AFAIK, the only "normalization" routines afforded by Vim
      (other than not using a separate screen cell for composing character) are: (a)
      the 'delcombine' option, which, if set, allows <BS> to erase one combining
      character at a time, while when clear (default) it will erase one spacing
      character together with any number of combining characters in the same screen
      cell; and (b) the \Z pattern atom, which will ignore combining characters
      anywhere in the text while matching. But AFAIK Vim will always treat "é"
      (U+00E9 LATIN SMALL LETTER E WITH ACUTE) and "é" (U+0065 LATIN SMALL LETTER E
      + U+0301 COMBINING ACUTE ACCENT) as different even if it displays them the same.

      >
      > These Combining Diacritical Marks need to be rendered above or below,
      > or attached in particular places, as appropriate, to any letter
      > character. For
      > that to work properly, you need a font (e.g. Doulos SIL or Charis
      > SIL) that
      > contains the diacritic-positioning information, and you need a
      > sophisticated rendering
      > engine (as in XeTeX) that reads and uses that diacritic-positioning
      > information.
      >
      > Most software, including text editors, still do a poor job of handling
      > Combining Diacritical Marks and supplementary characters in general.

      In Arabic, Vim handles combining vowels etc. ("harakaat" as Arabic grammarians
      call them) quite well, including several per character as e.g. in (spacing)
      seen (Arabic S) + combining shadda (geminated-consonant sign) + combining
      fatha (Arabic short vowel a), a combination which appears in the fully
      vocalized form of "as-salaam" (Peace). Starting recently (7.1.116), Vim can
      now display (not only edit) any codepoint in the current 'guifont', not only
      those in the BMP. From what you say above, it looks like Vim is ahead of "most
      software including text editors", but I don't doubt that the situation will
      get better as time goes on.

      >
      >> Since your Deseret characters are outside the BMP, each of them
      >> requires 4
      >> bytes in UTF-8 (also two 16-bit words in UTF-16 and one 32-bit
      >> doubleword in
      >> UTF-32); but maybe that's not what your measured "length" means?
      >> Does your
      >> NSString include a final null (as C strings do) or an initial
      >> bytecount (as
      >> Pascal strings do)? Or do your Deseret characters include
      >> "composing" elements?
      >
      > Because the "length" of each Deseret Character is being returned as 2
      > rather
      > than 1, it sounds like the MacVim code is using a Java-like UTF-16
      > internal representation
      > for storing Unicode characters (including supplementary characters).

      How do you compute that length? The strlen() function should return 4 for each
      Deseret character, and the function (similar to that mentioned under ":help
      strlen()")

      strlen(substitute(string,'.','-'))

      should return 1.

      >
      > There are no Combining Diacritical Marks required in the traditional
      > Deseret Alphabet, per se,
      > although proper rendering software _should_ allow you to associate
      > one or more Combining
      > Diacritics Marks with any letter character and have it rendered
      > acceptably. (Handling
      > combining diacritical marks with Deseret Alphabet is very low priority.)
      >
      > Each Deseret Alphabet letter is a single Unicode character, with a
      > single code
      > point value in the supplementary area (block starting U+10400). The
      > Shavian alphabet is
      > much the same (in the block starting U+10450). The glyphs are
      > straightforward, rendered
      > left-to-right, requiring no ligatures, and could be forced into a
      > fixed-pitch (mono) font about
      > as easily as Roman glyphs.
      >
      > Ken

      and a lot more easily than Arabic, where a single letter (with a single code
      point) may have to be shown in up to 4 different ways (not counting combining
      characters), depending on its position in the word and on which letter (if
      any) precedes it. Happily Vim (with +arabic) knows how to fetch the required
      "presentation forms" from the Arabic fonts. Anyway, the beautiful cursive
      shapes of Arabic still look ugly when rendered in any monospace font, but
      that's because Arabic calligraphy, with its long flourishes at the end of
      almost every word, was invented for the calame (i.e., the reed pen), not the
      typewriter.


      Best regards,
      Tony.
      --
      Try to be the best of whatever you are, even if what you are is no
      good.


      --~--~---------~--~----~------------~-------~--~----~
      You received this message from the "vim_multibyte" maillist.
      For more information, visit http://www.vim.org/maillist.php
      -~----------~----~----~----~------~----~------~--~---
    • Show all 17 messages in this topic