Loading ...
Sorry, an error occurred while loading the content.

Re: Vim on OS X, (no)macatsui problem

Expand Messages
  • Nico Weber
    ... Would have been easier if you d Reply all d. Here s what I think ... You can enter desert characters by opening the Character Palette, putting deseret
    Message 1 of 17 , Oct 8, 2007
    • 0 Attachment
      > Ugh. I tried sifting through the forwarded posts, but it was kind of
      > hard to understand them. I will try to read the posts on google
      > groups instead, unless somebody can summarize the problem(s) for me?

      Would have been easier if you'd "Reply all"d. Here's what I think
      Kenneth problems are:

      > I just installed the latest MacVim and tried it with a version of
      > DejaVuSansMono.ttf, augmented
      > with (monowidth) glyphs, the same width as the original
      > DejaVuSansMono.ttf glyphs,
      > for the Deseret Alphabet block (U+10400). It doesn't seem to work
      > for me. When I select my
      > Deseret Alphabet keymap and try to type Deseret Alphabet, I see
      > pseudo glyphs in boxes
      > rendered on the screen.

      You can enter desert characters by opening the Character Palette,
      putting "deseret" in the search box at the bottom and ... well, you
      know the rest. MacVim displays a "character not found" sign which is
      probably the Right Thing as the default DejaVu font seems not to
      include these characters, but Kenneth uses a font that _does_ have
      them. Having access to Kenneth's font would help...

      He also reports that mapping numbers `:map 3 ...` doesn't work. I
      can't reproduce this.

      Nico

      --~--~---------~--~----~------------~-------~--~----~
      You received this message from the "vim_multibyte" maillist.
      For more information, visit http://www.vim.org/maillist.php
      -~----------~----~----~----~------~----~------~--~---
    • Nico Weber
      ... I got this one wrong. See the other thread for Kenneth s clarification. Sorry. Nico --~--~---------~--~----~------------~-------~--~----~ You received this
      Message 2 of 17 , Oct 8, 2007
      • 0 Attachment
        > He also reports that mapping numbers `:map 3 ...` doesn't work. I
        > can't reproduce this.

        I got this one wrong. See the other thread for Kenneth's
        clarification. Sorry.

        Nico


        --~--~---------~--~----~------------~-------~--~----~
        You received this message from the "vim_multibyte" maillist.
        For more information, visit http://www.vim.org/maillist.php
        -~----------~----~----~----~------~----~------~--~---
      • björn
        ... Hi Ken, I have looked into why MacVim fails to render the deseret glyphs and I now have an answer, but unfortunately no solution. The problem is that one
        Message 3 of 17 , Oct 13, 2007
        • 0 Attachment
          > > He also reports that mapping numbers `:map 3 ...` doesn't work. I
          > > can't reproduce this.
          >
          > I got this one wrong. See the other thread for Kenneth's
          > clarification. Sorry.

          Hi Ken,

          I have looked into why MacVim fails to render the deseret glyphs and I
          now have an answer, but unfortunately no solution.

          The problem is that one deseret character for some reason takes up
          _two_ characters when put in the text storage (I guess this have
          something to do with Unicode?). Specifically, calling "length" on an
          NSString containing one deseret character returns 2 instead of 1, as I
          would expect.

          Now, I do know how to fix this problem, but since Jiang is working on
          moving his drawing code to MacVim I don't really want to spend any
          time doing this, since the problem will disappear as soon as he is
          finished. I'm sorry about that.


          /Björn


          --~--~---------~--~----~------------~-------~--~----~
          You received this message from the "vim_multibyte" maillist.
          For more information, visit http://www.vim.org/maillist.php
          -~----------~----~----~----~------~----~------~--~---
        • Tony Mechelynck
          ... UTF-8 uses: 1 byte for each codepoint in the range U+0000 - U+007F 2 bytes for each codepoint in the range U+0080 - U+07FF 3 bytes for each codepoint in
          Message 4 of 17 , Oct 13, 2007
          • 0 Attachment
            björn wrote:
            >>> He also reports that mapping numbers `:map 3 ...` doesn't work. I
            >>> can't reproduce this.
            >> I got this one wrong. See the other thread for Kenneth's
            >> clarification. Sorry.
            >
            > Hi Ken,
            >
            > I have looked into why MacVim fails to render the deseret glyphs and I
            > now have an answer, but unfortunately no solution.
            >
            > The problem is that one deseret character for some reason takes up
            > _two_ characters when put in the text storage (I guess this have
            > something to do with Unicode?). Specifically, calling "length" on an
            > NSString containing one deseret character returns 2 instead of 1, as I
            > would expect.
            >
            > Now, I do know how to fix this problem, but since Jiang is working on
            > moving his drawing code to MacVim I don't really want to spend any
            > time doing this, since the problem will disappear as soon as he is
            > finished. I'm sorry about that.
            >
            >
            > /Björn

            UTF-8 uses:
            1 byte for each codepoint in the range U+0000 - U+007F
            2 bytes for each codepoint in the range U+0080 - U+07FF
            3 bytes for each codepoint in the range U+0800 - U+FFFF
            4 bytes for each codepoint in the range U+10000 - U+1FFFFF
            Actually, current standards mandate that no codepoints higher than U+10FFFD
            will "ever" be used. (Vim supports up to U+3FFFFFFF, with up to 6 bytes per
            codepoint, following an earlier draft of the standard.)

            Unicode also has the notion of "composing characters", which are characters
            which are "superimposed" on the preceding character, possibly changing its
            shape. These are usually diacritics: most of the accents of Latin can be
            either precomposed or spacing-non-accented + composing-accent, but the
            optional vowel marks of Hebrew and Arabic exist only as composing characters.

            Since your Deseret characters are outside the BMP, each of them requires 4
            bytes in UTF-8 (also two 16-bit words in UTF-16 and one 32-bit doubleword in
            UTF-32); but maybe that's not what your measured "length" means? Does your
            NSString include a final null (as C strings do) or an initial bytecount (as
            Pascal strings do)? Or do your Deseret characters include "composing" elements?


            Best regards,
            Tony.
            --
            hundred-and-one symptoms of being an internet addict:
            55. You ask your doctor to implant a gig in your brain.

            --~--~---------~--~----~------------~-------~--~----~
            You received this message from the "vim_multibyte" maillist.
            For more information, visit http://www.vim.org/maillist.php
            -~----------~----~----~----~------~----~------~--~---
          • björn
            ... I m sorry about the confusion with posting this thread separately on vim_multibyte and vim_mac...I ll try to bring the diverging threads together by
            Message 5 of 17 , Oct 14, 2007
            • 0 Attachment
              > > The problem is that one deseret character for some reason takes up
              > > _two_ characters when put in the text storage (I guess this have
              > > something to do with Unicode?). Specifically, calling "length" on an
              > > NSString containing one deseret character returns 2 instead of 1, as I
              > > would expect.
              > >
              > UTF-8 uses:
              > 1 byte for each codepoint in the range U+0000 - U+007F
              > 2 bytes for each codepoint in the range U+0080 - U+07FF
              > 3 bytes for each codepoint in the range U+0800 - U+FFFF
              > 4 bytes for each codepoint in the range U+10000 - U+1FFFFF
              > Actually, current standards mandate that no codepoints higher than U+10FFFD
              > will "ever" be used. (Vim supports up to U+3FFFFFFF, with up to 6 bytes per
              > codepoint, following an earlier draft of the standard.)
              >
              > Unicode also has the notion of "composing characters", which are characters
              > which are "superimposed" on the preceding character, possibly changing its
              > shape. These are usually diacritics: most of the accents of Latin can be
              > either precomposed or spacing-non-accented + composing-accent, but the
              > optional vowel marks of Hebrew and Arabic exist only as composing characters.
              >
              > Since your Deseret characters are outside the BMP, each of them requires 4
              > bytes in UTF-8 (also two 16-bit words in UTF-16 and one 32-bit doubleword in
              > UTF-32); but maybe that's not what your measured "length" means? Does your
              > NSString include a final null (as C strings do) or an initial bytecount (as
              > Pascal strings do)? Or do your Deseret characters include "composing" elements?

              I'm sorry about the confusion with posting this thread separately on
              vim_multibyte and vim_mac...I'll try to bring the diverging threads
              together by posting this reply to both groups.

              Tim Allen replied to the vim_mac thread saying that NSString uses
              utf-16 internally and this is indeed why it says one deseret char has
              length 2 (since it needs two 16 bit chars to store one deseret char,
              as has been pointed out already).

              I was under the mistaken impression that NSString always returned
              length 1 for one character (not counting composing characters), which
              is why I thought MacVim would work in all situations except when
              composing characters were used. Again, this can be fixed by getting
              rid of the assumption that each line in the text storage has the same
              length (as returned by NSString), but this is a rather big code
              change.

              Thanks to Tony and Tim for educating me on the finer points of Unicode... :-)


              /Björn

              --~--~---------~--~----~------------~-------~--~----~
              You received this message from the "vim_multibyte" maillist.
              For more information, visit http://www.vim.org/maillist.php
              -~----------~----~----~----~------~----~------~--~---
            • Tony Mechelynck
              björn wrote: [...] ... Yes, obviously (if one thinks about it) one UTF-16 16-bit word cannot represent anything above U+FFFF. For codepoints U+10000 to
              Message 6 of 17 , Oct 14, 2007
              • 0 Attachment
                björn wrote:
                [...]
                > I'm sorry about the confusion with posting this thread separately on
                > vim_multibyte and vim_mac...I'll try to bring the diverging threads
                > together by posting this reply to both groups.
                >
                > Tim Allen replied to the vim_mac thread saying that NSString uses
                > utf-16 internally and this is indeed why it says one deseret char has
                > length 2 (since it needs two 16 bit chars to store one deseret char,
                > as has been pointed out already).

                Yes, obviously (if one thinks about it) one UTF-16 16-bit word cannot
                represent anything above U+FFFF. For codepoints U+10000 to U+10FFFF (including
                Deseret, among others), two "surrogate characters" are used -- two 16-bit
                words, one in the range 0xD800-0xDBFF and the other in the range 0xDC00-0xDFFF
                : see
                http://en.wikipedia.org/wiki/UTF-16#Encoding_of_characters_outside_the_BMP for
                details. Unlike UTF-8 and UTF-32, UTF-16 inherently cannot, even with
                surrogates, represent anything above U+10FFFF, and (I suppose) that's (one of
                the reasons) why it was decided to bring the "upper range" of Unicode down
                from U+7FFFFFFF to U+10FFFF (and even U+10FFFD since for other reasons, the
                last two codepoints of every plane -- U+xxFFFE and U+xxFFFF -- are "invalid").

                >
                > I was under the mistaken impression that NSString always returned
                > length 1 for one character (not counting composing characters), which
                > is why I thought MacVim would work in all situations except when
                > composing characters were used. Again, this can be fixed by getting
                > rid of the assumption that each line in the text storage has the same
                > length (as returned by NSString), but this is a rather big code
                > change.
                >
                > Thanks to Tony and Tim for educating me on the finer points of Unicode... :-)

                My pleasure. :-)

                >
                >
                > /Björn

                Best regards,
                Tony.
                --
                Court, n.:
                A place where they dispense with justice.
                -- Arthur Train

                --~--~---------~--~----~------------~-------~--~----~
                You received this message from the "vim_multibyte" maillist.
                For more information, visit http://www.vim.org/maillist.php
                -~----------~----~----~----~------~----~------~--~---
              • Kenneth Beesley
                Hi Bjôrn, Many thanks for the message. Yeah, the term Character is a technical term in Unicode, and each Unicode character has a code point value that ranges
                Message 7 of 17 , Oct 15, 2007
                • 0 Attachment
                  Hi Bjôrn,

                  Many thanks for the message.

                  Yeah, the term Character is a technical term in Unicode, and each
                  Unicode character has a code point value that ranges from 0x0 to
                  0x10FFFF.

                  In the original vision of Unicode, code point values ranged from 0x0
                  to 0xFFFF, allowing just 64k distinct characters. This old limited
                  range
                  is now known as the Basic Multilingual Plane (BMP). The current
                  vision of Unicode, now 10 years old, allows about a million characters,
                  and the characters with code point values beyond 0xFFFF are known
                  as supplementary characters.

                  Many software applications still haven't caught up with supplementary
                  characters. They're still stuck in the BMP.

                  In Java, there is a type called "char" that has 16 bits and so can
                  represent any code point value in the BMP, 0x0 to 0xFFFF. It is
                  important
                  not to confuse "char" with the Unicode notion of Character. In Java,
                  to store a supplementary Unicode character, two "chars" are used, in a
                  coding system known as UTF-16. It sounds like MacVim has a similar
                  storage system, and that the length-in-chars is being confused with
                  the length-in-Unicode-characters.

                  Best wishes,

                  Ken



                  On 13 Oct 2007, at 12:45, björn wrote:

                  >
                  >>> He also reports that mapping numbers `:map 3 ...` doesn't work. I
                  >>> can't reproduce this.
                  >>
                  >> I got this one wrong. See the other thread for Kenneth's
                  >> clarification. Sorry.
                  >
                  > Hi Ken,
                  >
                  > I have looked into why MacVim fails to render the deseret glyphs and I
                  > now have an answer, but unfortunately no solution.
                  >
                  > The problem is that one deseret character for some reason takes up
                  > _two_ characters when put in the text storage (I guess this have
                  > something to do with Unicode?). Specifically, calling "length" on an
                  > NSString containing one deseret character returns 2 instead of 1, as I
                  > would expect.
                  >
                  > Now, I do know how to fix this problem, but since Jiang is working on
                  > moving his drawing code to MacVim I don't really want to spend any
                  > time doing this, since the problem will disappear as soon as he is
                  > finished. I'm sorry about that.
                  >
                  >
                  > /Björn
                  >
                  >
                  > >


                  --~--~---------~--~----~------------~-------~--~----~
                  You received this message from the "vim_multibyte" maillist.
                  For more information, visit http://www.vim.org/maillist.php
                  -~----------~----~----~----~------~----~------~--~---
                • Kenneth Beesley
                  Tony, Great message, as usual. I insert some friendly comments below. ... KRB: The current modern Unicode character set has code point values ranging from U+0
                  Message 8 of 17 , Oct 15, 2007
                  • 0 Attachment
                    Tony,

                    Great message, as usual.
                    I insert some friendly comments below.

                    On 13 Oct 2007, at 18:30, Tony Mechelynck wrote:

                    >
                    > björn wrote:
                    >>>> He also reports that mapping numbers `:map 3 ...` doesn't work. I
                    >>>> can't reproduce this.
                    >>> I got this one wrong. See the other thread for Kenneth's
                    >>> clarification. Sorry.
                    >>
                    >> Hi Ken,
                    >>
                    >> I have looked into why MacVim fails to render the deseret glyphs
                    >> and I
                    >> now have an answer, but unfortunately no solution.
                    >>
                    >> The problem is that one deseret character for some reason takes up
                    >> _two_ characters when put in the text storage (I guess this have
                    >> something to do with Unicode?). Specifically, calling "length" on an
                    >> NSString containing one deseret character returns 2 instead of 1,
                    >> as I
                    >> would expect.
                    >>
                    >> Now, I do know how to fix this problem, but since Jiang is working on
                    >> moving his drawing code to MacVim I don't really want to spend any
                    >> time doing this, since the problem will disappear as soon as he is
                    >> finished. I'm sorry about that.
                    >>
                    >>
                    >> /Björn
                    >

                    Tony responds:
                    > UTF-8 uses:
                    > 1 byte for each codepoint in the range U+0000 - U+007F
                    > 2 bytes for each codepoint in the range U+0080 - U+07FF
                    > 3 bytes for each codepoint in the range U+0800 - U+FFFF
                    > 4 bytes for each codepoint in the range U+10000 - U+1FFFFF

                    KRB: The current modern Unicode character set has code point
                    values ranging from U+0 to U+10FFFF, allowing about a million
                    distinct "characters". These Unicode Characters are slightly abstract
                    and need to be distinguished carefully from how they are "encoded"
                    in a file or in a programming language. In UTF-8 encoding, the "code
                    unit"
                    is one byte, and each Unicode character (each code point value) is
                    stored in one
                    to four bytes as you describe above. The conversion between code point
                    values (integers) and the bit/byte representations requires some trivlal
                    bit extraction and shifting.

                    What Bjôrn describes sounds more like UTF-16, where each Unicode
                    character (code point value) is stored in either one 16-bit "code unit"
                    or in two 16-bit code units. Characters from the Basic Multilingual
                    Plane,
                    U+0 to U+FFFF, are stored in a single 16-bit code unit. Supplementary
                    characters, those beyond the Basic Multilingual Plane, are stored in two
                    16-bit code units. (Again there is some trivial bit manipulation
                    involved
                    in conversion between code point values and the bit representations
                    in the 16-bit code units.)

                    Perl stores Unicode strings internally as UTF-8, but you should
                    hardly ever
                    have to know that. If you ask for the length of a Perl Unicode
                    string, Perl gives
                    you the length in Unicode Characters. If you loop through the
                    characters in
                    a Perl Unicode string, it loops through Unicode Characters, taking
                    care of
                    the underlying UTF-8 encoding in the background. The underlying
                    encoding
                    in UTF-8 is effectively hidden from the programmer. At the
                    programming level,
                    you can always think of a Perl Unicode string as a sequence of Unicode
                    Characters (including supplementary characters).

                    Java from the very beginning took Unicode very seriously. But Java
                    emerged
                    in the olden days of Unicode, when code point values ranged only from
                    U+0 to
                    U+FFFF, so every original Unicode character could be stored in a single
                    16-bit "char". The length of a Unicode string was simply the number
                    of chars.
                    Easy and clean.
                    The introduction of supplementary Unicode characters 10 years ago
                    created
                    quite a challenge for Java and other programming languages that wanted
                    to take Unicode seriously. Instead of accommodating the New Unicode by
                    making char 32 bits (which would allow each New Unicode character to be
                    stored straightforwardly in a single 32-bit char) the Java gurus
                    opted to keep "char" at 16-bits
                    and use UTF-16 to store Unicode strings. If you ask for the "length"
                    of a Unicode
                    string in Java, it still returns the length in chars rather than the
                    length in Unicode
                    Characters. This is (arguably) quite a mess, and you have to be very
                    aware of it
                    as a programmer if you want to handle Supplementary Unicode Characters.

                    The way that Python handles Unicode strings internally depends on how
                    it is configured/built. If configured for "ucs2", Python stores
                    Unicode strings as
                    UTF-16, returns the "length" of strings as the number of 16-bit code
                    units, and
                    if you try to loop through the elements of a string, it loops through
                    16-bit
                    values, which creates a mess if your string contains supplementary
                    characters.
                    This is comparable to the situation in Java.

                    If you configure Python for "ucs4", then each Unicode string is
                    stored internally as
                    a string of 32-bit code units, "length" is returned as the number of
                    Unicode characters, and if you loop through the characters in a
                    string, you
                    get one Unicode character (code point value) at a time, even for
                    supplementary
                    characters. This "ucs4" option is now formally termed UTF-32 in Unicode
                    circles.



                    > Actually, current standards mandate that no codepoints higher than U
                    > +10FFFD
                    > will "ever" be used. (Vim supports up to U+3FFFFFFF, with up to 6
                    > bytes per
                    > codepoint, following an earlier draft of the standard.)
                    >
                    > Unicode also has the notion of "composing characters", which are
                    > characters
                    > which are "superimposed" on the preceding character, possibly
                    > changing its
                    > shape. These are usually diacritics: most of the accents of Latin
                    > can be
                    > either precomposed or spacing-non-accented + composing-accent, but the
                    > optional vowel marks of Hebrew and Arabic exist only as composing
                    > characters.

                    Quite right. "Character" is a technical term in Unicode, and
                    includes spaces,
                    punctuation and these Composing Diacritical Marks (block starting U
                    +0300)
                    that might not fall under the everyday notion of character. An
                    acute-accented é,
                    for example, can be represented in Unicode either as a single character,

                    U+00E9

                    which has the name LATIN SMALL LETTER E WITH ACUTE

                    You can alternatively represent é as a sequence of two Unicode
                    characters

                    U+0065 LATIN SMALL LETTER E
                    U+0301 COMBINING ACUTE ACCENT

                    The Unicode gods have explicitly decreed that these two
                    representations are
                    equivalent, which means that any proper Unicode-capable editor should
                    handle and display them equivalently.

                    In Hopi (spoken in Arizona) orthography (as defined at the University of
                    Arizona), you have some double-accented graphemes like o with both
                    diaeresis and an acute, grave or circumflex accent. In Unicode you
                    can represent o with diaeresis and acute (the acute accent is rendered
                    above the diaeresis) as either the three-character sequence

                    U+006F UNICODE SMALL LETTER O
                    U+0308 COMBINING DIAERESIS
                    U+0301 COMBINING ACUTE ACCENT

                    or as the two-character sequence

                    U+00F6 LATIN SMALL LETTER O WITH DIAERESIS
                    U+0301 COMBINING ACUTE ACCENT

                    But there is no single "pre-composed" Unicode character for this
                    purpose.

                    This whole issue of Combining Diacritical Marks is separate from the
                    issue
                    of encoding (UTF-8, UTF-16 or UTF-32). Some conversion between "pre-
                    composed"
                    and "decomposed" representations can be done using "Normalization"
                    routines
                    available in Perl, Python, Java, ICU, etc.

                    These Combining Diacritical Marks need to be rendered above or below,
                    or attached in particular places, as appropriate, to any letter
                    character. For
                    that to work properly, you need a font (e.g. Doulos SIL or Charis
                    SIL) that
                    contains the diacritic-positioning information, and you need a
                    sophisticated rendering
                    engine (as in XeTeX) that reads and uses that diacritic-positioning
                    information.

                    Most software, including text editors, still do a poor job of handling
                    Combining Diacritical Marks and supplementary characters in general.

                    >
                    > Since your Deseret characters are outside the BMP, each of them
                    > requires 4
                    > bytes in UTF-8 (also two 16-bit words in UTF-16 and one 32-bit
                    > doubleword in
                    > UTF-32); but maybe that's not what your measured "length" means?
                    > Does your
                    > NSString include a final null (as C strings do) or an initial
                    > bytecount (as
                    > Pascal strings do)? Or do your Deseret characters include
                    > "composing" elements?

                    Because the "length" of each Deseret Character is being returned as 2
                    rather
                    than 1, it sounds like the MacVim code is using a Java-like UTF-16
                    internal representation
                    for storing Unicode characters (including supplementary characters).

                    There are no Combining Diacritical Marks required in the traditional
                    Deseret Alphabet, per se,
                    although proper rendering software _should_ allow you to associate
                    one or more Combining
                    Diacritics Marks with any letter character and have it rendered
                    acceptably. (Handling
                    combining diacritical marks with Deseret Alphabet is very low priority.)

                    Each Deseret Alphabet letter is a single Unicode character, with a
                    single code
                    point value in the supplementary area (block starting U+10400). The
                    Shavian alphabet is
                    much the same (in the block starting U+10450). The glyphs are
                    straightforward, rendered
                    left-to-right, requiring no ligatures, and could be forced into a
                    fixed-pitch (mono) font about
                    as easily as Roman glyphs.

                    Ken



                    >
                    >
                    > Best regards,
                    > Tony.
                    > --
                    > hundred-and-one symptoms of being an internet addict:
                    > 55. You ask your doctor to implant a gig in your brain.
                    >
                    > >


                    --~--~---------~--~----~------------~-------~--~----~
                    You received this message from the "vim_multibyte" maillist.
                    For more information, visit http://www.vim.org/maillist.php
                    -~----------~----~----~----~------~----~------~--~---
                  • Tony Mechelynck
                    ... Vim doesn t use UTF-16 internally, because the many intervening nulls would wreak havoc with the C requirement of null-terminated strings. If you set
                    Message 9 of 17 , Oct 16, 2007
                    • 0 Attachment
                      Kenneth Beesley wrote:
                      > Hi Bjôrn,
                      >
                      > Many thanks for the message.
                      >
                      > Yeah, the term Character is a technical term in Unicode, and each
                      > Unicode character has a code point value that ranges from 0x0 to
                      > 0x10FFFF.
                      >
                      > In the original vision of Unicode, code point values ranged from 0x0
                      > to 0xFFFF, allowing just 64k distinct characters. This old limited
                      > range
                      > is now known as the Basic Multilingual Plane (BMP). The current
                      > vision of Unicode, now 10 years old, allows about a million characters,
                      > and the characters with code point values beyond 0xFFFF are known
                      > as supplementary characters.
                      >
                      > Many software applications still haven't caught up with supplementary
                      > characters. They're still stuck in the BMP.
                      >
                      > In Java, there is a type called "char" that has 16 bits and so can
                      > represent any code point value in the BMP, 0x0 to 0xFFFF. It is
                      > important
                      > not to confuse "char" with the Unicode notion of Character. In Java,
                      > to store a supplementary Unicode character, two "chars" are used, in a
                      > coding system known as UTF-16. It sounds like MacVim has a similar
                      > storage system, and that the length-in-chars is being confused with
                      > the length-in-Unicode-characters.
                      >
                      > Best wishes,
                      >
                      > Ken

                      Vim doesn't use UTF-16 internally, because the many intervening nulls would
                      wreak havoc with the C requirement of null-terminated strings. If you set
                      'encoding' to UCS-4, UTF-16 or UTF-32 (of any endianness), Vim will actually
                      use UTF-8 internally, because 0x00 in UTF-8 is the NULL character (codepoint
                      U+0000), nothing else, and Vim already knows how to handle that.

                      When you set 'fileencoding' to UTF-16, the internal UTF-8 representation of
                      the text will be converted to and from UTF-16 when writing or reading
                      (respectively), using surrogate pairs for any codepoint above U+FFFF, so that,
                      _on disk_, they take two UTF-16 words rather than one.

                      I don't know what function you used to count characters, but the Vim
                      string-length function, strlen(), gives a string's length in _bytes_ in the
                      current internal representation: for Unicode, "a" (U+0061) is one, "é"
                      (e-acute, U+00E9) is two, "†" (dagger, U+2020) is three and any Deseret
                      character is four. (Under ":help strlen()" you can see how to count
                      "characters" in a string, as opposed to "bytes".)

                      >
                      >
                      >
                      > On 13 Oct 2007, at 12:45, björn wrote:
                      >
                      >>>> He also reports that mapping numbers `:map 3 ...` doesn't work. I
                      >>>> can't reproduce this.
                      >>> I got this one wrong. See the other thread for Kenneth's
                      >>> clarification. Sorry.
                      >> Hi Ken,
                      >>
                      >> I have looked into why MacVim fails to render the deseret glyphs and I
                      >> now have an answer, but unfortunately no solution.
                      >>
                      >> The problem is that one deseret character for some reason takes up
                      >> _two_ characters when put in the text storage (I guess this have
                      >> something to do with Unicode?). Specifically, calling "length" on an
                      >> NSString containing one deseret character returns 2 instead of 1, as I
                      >> would expect.
                      >>
                      >> Now, I do know how to fix this problem, but since Jiang is working on
                      >> moving his drawing code to MacVim I don't really want to spend any
                      >> time doing this, since the problem will disappear as soon as he is
                      >> finished. I'm sorry about that.
                      >>
                      >>
                      >> /Björn


                      Best regards,
                      Tony.
                      --
                      During a grouse hunt in North Carolina two intrepid sportsmen
                      were blasting away at a clump of trees near a stone wall. Suddenly a
                      red-faced country squire popped his head over the wall and shouted,
                      "Hey, you almost hit my wife."
                      "Did I?" cried the hunter, aghast. "Terribly sorry. Have a
                      shot at mine, over there."

                      --~--~---------~--~----~------------~-------~--~----~
                      You received this message from the "vim_multibyte" maillist.
                      For more information, visit http://www.vim.org/maillist.php
                      -~----------~----~----~----~------~----~------~--~---
                    • Tony Mechelynck
                      ... Hm. I guess I ll stay with Vim and vim-script, where I know what to expect. ... also control characters (carriage return, line feed, form feed, horizontal
                      Message 10 of 17 , Oct 16, 2007
                      • 0 Attachment
                        Kenneth Beesley wrote:
                        >
                        > Tony,
                        >
                        > Great message, as usual.
                        > I insert some friendly comments below.
                        >
                        > On 13 Oct 2007, at 18:30, Tony Mechelynck wrote:
                        >
                        >> björn wrote:
                        >>>>> He also reports that mapping numbers `:map 3 ...` doesn't work. I
                        >>>>> can't reproduce this.
                        >>>> I got this one wrong. See the other thread for Kenneth's
                        >>>> clarification. Sorry.
                        >>> Hi Ken,
                        >>>
                        >>> I have looked into why MacVim fails to render the deseret glyphs
                        >>> and I
                        >>> now have an answer, but unfortunately no solution.
                        >>>
                        >>> The problem is that one deseret character for some reason takes up
                        >>> _two_ characters when put in the text storage (I guess this have
                        >>> something to do with Unicode?). Specifically, calling "length" on an
                        >>> NSString containing one deseret character returns 2 instead of 1,
                        >>> as I
                        >>> would expect.
                        >>>
                        >>> Now, I do know how to fix this problem, but since Jiang is working on
                        >>> moving his drawing code to MacVim I don't really want to spend any
                        >>> time doing this, since the problem will disappear as soon as he is
                        >>> finished. I'm sorry about that.
                        >>>
                        >>>
                        >>> /Björn
                        >
                        > Tony responds:
                        >> UTF-8 uses:
                        >> 1 byte for each codepoint in the range U+0000 - U+007F
                        >> 2 bytes for each codepoint in the range U+0080 - U+07FF
                        >> 3 bytes for each codepoint in the range U+0800 - U+FFFF
                        >> 4 bytes for each codepoint in the range U+10000 - U+1FFFFF
                        >
                        > KRB: The current modern Unicode character set has code point
                        > values ranging from U+0 to U+10FFFF, allowing about a million
                        > distinct "characters". These Unicode Characters are slightly abstract
                        > and need to be distinguished carefully from how they are "encoded"
                        > in a file or in a programming language. In UTF-8 encoding, the "code
                        > unit"
                        > is one byte, and each Unicode character (each code point value) is
                        > stored in one
                        > to four bytes as you describe above. The conversion between code point
                        > values (integers) and the bit/byte representations requires some trivlal
                        > bit extraction and shifting.
                        >
                        > What Bjôrn describes sounds more like UTF-16, where each Unicode
                        > character (code point value) is stored in either one 16-bit "code unit"
                        > or in two 16-bit code units. Characters from the Basic Multilingual
                        > Plane,
                        > U+0 to U+FFFF, are stored in a single 16-bit code unit. Supplementary
                        > characters, those beyond the Basic Multilingual Plane, are stored in two
                        > 16-bit code units. (Again there is some trivial bit manipulation
                        > involved
                        > in conversion between code point values and the bit representations
                        > in the 16-bit code units.)
                        >
                        > Perl stores Unicode strings internally as UTF-8, but you should
                        > hardly ever
                        > have to know that. If you ask for the length of a Perl Unicode
                        > string, Perl gives
                        > you the length in Unicode Characters. If you loop through the
                        > characters in
                        > a Perl Unicode string, it loops through Unicode Characters, taking
                        > care of
                        > the underlying UTF-8 encoding in the background. The underlying
                        > encoding
                        > in UTF-8 is effectively hidden from the programmer. At the
                        > programming level,
                        > you can always think of a Perl Unicode string as a sequence of Unicode
                        > Characters (including supplementary characters).
                        >
                        > Java from the very beginning took Unicode very seriously. But Java
                        > emerged
                        > in the olden days of Unicode, when code point values ranged only from
                        > U+0 to
                        > U+FFFF, so every original Unicode character could be stored in a single
                        > 16-bit "char". The length of a Unicode string was simply the number
                        > of chars.
                        > Easy and clean.
                        > The introduction of supplementary Unicode characters 10 years ago
                        > created
                        > quite a challenge for Java and other programming languages that wanted
                        > to take Unicode seriously. Instead of accommodating the New Unicode by
                        > making char 32 bits (which would allow each New Unicode character to be
                        > stored straightforwardly in a single 32-bit char) the Java gurus
                        > opted to keep "char" at 16-bits
                        > and use UTF-16 to store Unicode strings. If you ask for the "length"
                        > of a Unicode
                        > string in Java, it still returns the length in chars rather than the
                        > length in Unicode
                        > Characters. This is (arguably) quite a mess, and you have to be very
                        > aware of it
                        > as a programmer if you want to handle Supplementary Unicode Characters.

                        Hm. I guess I'll stay with Vim and vim-script, where I know what to expect.

                        >
                        > The way that Python handles Unicode strings internally depends on how
                        > it is configured/built. If configured for "ucs2", Python stores
                        > Unicode strings as
                        > UTF-16, returns the "length" of strings as the number of 16-bit code
                        > units, and
                        > if you try to loop through the elements of a string, it loops through
                        > 16-bit
                        > values, which creates a mess if your string contains supplementary
                        > characters.
                        > This is comparable to the situation in Java.
                        >
                        > If you configure Python for "ucs4", then each Unicode string is
                        > stored internally as
                        > a string of 32-bit code units, "length" is returned as the number of
                        > Unicode characters, and if you loop through the characters in a
                        > string, you
                        > get one Unicode character (code point value) at a time, even for
                        > supplementary
                        > characters. This "ucs4" option is now formally termed UTF-32 in Unicode
                        > circles.
                        >
                        >
                        >
                        >> Actually, current standards mandate that no codepoints higher than U
                        >> +10FFFD
                        >> will "ever" be used. (Vim supports up to U+3FFFFFFF, with up to 6
                        >> bytes per
                        >> codepoint, following an earlier draft of the standard.)
                        >>
                        >> Unicode also has the notion of "composing characters", which are
                        >> characters
                        >> which are "superimposed" on the preceding character, possibly
                        >> changing its
                        >> shape. These are usually diacritics: most of the accents of Latin
                        >> can be
                        >> either precomposed or spacing-non-accented + composing-accent, but the
                        >> optional vowel marks of Hebrew and Arabic exist only as composing
                        >> characters.
                        >
                        > Quite right. "Character" is a technical term in Unicode, and
                        > includes spaces,
                        > punctuation and these Composing Diacritical Marks (block starting U
                        > +0300)
                        > that might not fall under the everyday notion of character. An

                        also control characters (carriage return, line feed, form feed, horizontal
                        tab, soft hyphen, byte-order mark, zero-width joiner, etc.), which also might
                        not all fall under the everyday notion of "character".

                        > acute-accented é,
                        > for example, can be represented in Unicode either as a single character,
                        >
                        > U+00E9
                        >
                        > which has the name LATIN SMALL LETTER E WITH ACUTE
                        >
                        > You can alternatively represent é as a sequence of two Unicode
                        > characters
                        >
                        > U+0065 LATIN SMALL LETTER E
                        > U+0301 COMBINING ACUTE ACCENT
                        >
                        > The Unicode gods have explicitly decreed that these two
                        > representations are
                        > equivalent, which means that any proper Unicode-capable editor should
                        > handle and display them equivalently.
                        >
                        > In Hopi (spoken in Arizona) orthography (as defined at the University of
                        > Arizona), you have some double-accented graphemes like o with both
                        > diaeresis and an acute, grave or circumflex accent. In Unicode you
                        > can represent o with diaeresis and acute (the acute accent is rendered
                        > above the diaeresis) as either the three-character sequence
                        >
                        > U+006F UNICODE SMALL LETTER O
                        > U+0308 COMBINING DIAERESIS
                        > U+0301 COMBINING ACUTE ACCENT
                        >
                        > or as the two-character sequence
                        >
                        > U+00F6 LATIN SMALL LETTER O WITH DIAERESIS
                        > U+0301 COMBINING ACUTE ACCENT
                        >
                        > But there is no single "pre-composed" Unicode character for this
                        > purpose.
                        >
                        > This whole issue of Combining Diacritical Marks is separate from the
                        > issue
                        > of encoding (UTF-8, UTF-16 or UTF-32). Some conversion between "pre-
                        > composed"
                        > and "decomposed" representations can be done using "Normalization"
                        > routines
                        > available in Perl, Python, Java, ICU, etc.

                        but not in Vim. AFAIK, the only "normalization" routines afforded by Vim
                        (other than not using a separate screen cell for composing character) are: (a)
                        the 'delcombine' option, which, if set, allows <BS> to erase one combining
                        character at a time, while when clear (default) it will erase one spacing
                        character together with any number of combining characters in the same screen
                        cell; and (b) the \Z pattern atom, which will ignore combining characters
                        anywhere in the text while matching. But AFAIK Vim will always treat "é"
                        (U+00E9 LATIN SMALL LETTER E WITH ACUTE) and "é" (U+0065 LATIN SMALL LETTER E
                        + U+0301 COMBINING ACUTE ACCENT) as different even if it displays them the same.

                        >
                        > These Combining Diacritical Marks need to be rendered above or below,
                        > or attached in particular places, as appropriate, to any letter
                        > character. For
                        > that to work properly, you need a font (e.g. Doulos SIL or Charis
                        > SIL) that
                        > contains the diacritic-positioning information, and you need a
                        > sophisticated rendering
                        > engine (as in XeTeX) that reads and uses that diacritic-positioning
                        > information.
                        >
                        > Most software, including text editors, still do a poor job of handling
                        > Combining Diacritical Marks and supplementary characters in general.

                        In Arabic, Vim handles combining vowels etc. ("harakaat" as Arabic grammarians
                        call them) quite well, including several per character as e.g. in (spacing)
                        seen (Arabic S) + combining shadda (geminated-consonant sign) + combining
                        fatha (Arabic short vowel a), a combination which appears in the fully
                        vocalized form of "as-salaam" (Peace). Starting recently (7.1.116), Vim can
                        now display (not only edit) any codepoint in the current 'guifont', not only
                        those in the BMP. From what you say above, it looks like Vim is ahead of "most
                        software including text editors", but I don't doubt that the situation will
                        get better as time goes on.

                        >
                        >> Since your Deseret characters are outside the BMP, each of them
                        >> requires 4
                        >> bytes in UTF-8 (also two 16-bit words in UTF-16 and one 32-bit
                        >> doubleword in
                        >> UTF-32); but maybe that's not what your measured "length" means?
                        >> Does your
                        >> NSString include a final null (as C strings do) or an initial
                        >> bytecount (as
                        >> Pascal strings do)? Or do your Deseret characters include
                        >> "composing" elements?
                        >
                        > Because the "length" of each Deseret Character is being returned as 2
                        > rather
                        > than 1, it sounds like the MacVim code is using a Java-like UTF-16
                        > internal representation
                        > for storing Unicode characters (including supplementary characters).

                        How do you compute that length? The strlen() function should return 4 for each
                        Deseret character, and the function (similar to that mentioned under ":help
                        strlen()")

                        strlen(substitute(string,'.','-'))

                        should return 1.

                        >
                        > There are no Combining Diacritical Marks required in the traditional
                        > Deseret Alphabet, per se,
                        > although proper rendering software _should_ allow you to associate
                        > one or more Combining
                        > Diacritics Marks with any letter character and have it rendered
                        > acceptably. (Handling
                        > combining diacritical marks with Deseret Alphabet is very low priority.)
                        >
                        > Each Deseret Alphabet letter is a single Unicode character, with a
                        > single code
                        > point value in the supplementary area (block starting U+10400). The
                        > Shavian alphabet is
                        > much the same (in the block starting U+10450). The glyphs are
                        > straightforward, rendered
                        > left-to-right, requiring no ligatures, and could be forced into a
                        > fixed-pitch (mono) font about
                        > as easily as Roman glyphs.
                        >
                        > Ken

                        and a lot more easily than Arabic, where a single letter (with a single code
                        point) may have to be shown in up to 4 different ways (not counting combining
                        characters), depending on its position in the word and on which letter (if
                        any) precedes it. Happily Vim (with +arabic) knows how to fetch the required
                        "presentation forms" from the Arabic fonts. Anyway, the beautiful cursive
                        shapes of Arabic still look ugly when rendered in any monospace font, but
                        that's because Arabic calligraphy, with its long flourishes at the end of
                        almost every word, was invented for the calame (i.e., the reed pen), not the
                        typewriter.


                        Best regards,
                        Tony.
                        --
                        Try to be the best of whatever you are, even if what you are is no
                        good.


                        --~--~---------~--~----~------------~-------~--~----~
                        You received this message from the "vim_multibyte" maillist.
                        For more information, visit http://www.vim.org/maillist.php
                        -~----------~----~----~----~------~----~------~--~---
                      Your message has been successfully submitted and would be delivered to recipients shortly.