Loading ...
Sorry, an error occurred while loading the content.
 

Re: is vim putting the right bytes for utf-8 ?

Expand Messages
  • François Pinard
    [Mike Williams] ... Allow me to quote the Recode manual (:-): Universal Transformation Format, 16 bits ======================================== Another
    Message 1 of 45 , Aug 1, 2005
      [Mike Williams]

      > >>The Unicode codespace is 21-bit only

      > >Roughly, but not exactly. You have to exclude surrogates from the
      > >codespace.

      > Why do you say that? They are defined ranges of values in the 21-bit
      > codespace, and have well defined semantics in all Unicode encodings.
      > Every value in the 21-bit codespace has an assignment, even if it is
      > "unassigned codepoint".

      Allow me to quote the Recode manual (:-):


      Universal Transformation Format, 16 bits
      ========================================

      Another external surface of `UCS' is also variable length, each
      character using either two or four bytes. It is usable for the subset
      defined by the first million characters (17 * 2^16) of `UCS'.

      Martin J. Du"rst writes (to `comp.std.internat', on 1995-03-28):

      `UTF-16' is another method that reserves two times 1024 codepoints
      in Unicode and uses them to index around one million additional
      characters. `UTF-16' is a little bit like former multibyte codes,
      but quite not so, as both the first and the second 16-bit code
      clearly show what they are. The idea is that one million
      codepoints should be enough for all the rare Chinese ideograms and
      historical scripts that do not fit into the Base Multilingual
      Plane of ISO 10646 (with just about 63,000 positions available,
      now that 2,000 are gone).

      This charset is available in `recode' under the name `UTF-16'.
      Accepted aliases are `Unicode', `TF-16' and `u6'.

      --
      François Pinard http://pinard.progiciels-bpi.ca
    • Mike Williams
      ... I think we are talkng at crossed purposes. The use of 2x1024 codepoints for the surrogate pair values means there are 2x1024 less characters that could be
      Message 45 of 45 , Aug 1, 2005
        François Pinard did utter on 01/08/2005 14:15:
        > [Mike Williams]
        >
        >
        >>>>The Unicode codespace is 21-bit only
        >
        >
        >>>Roughly, but not exactly. You have to exclude surrogates from the
        >>>codespace.
        >
        >
        >>Why do you say that? They are defined ranges of values in the 21-bit
        >>codespace, and have well defined semantics in all Unicode encodings.
        >>Every value in the 21-bit codespace has an assignment, even if it is
        >>"unassigned codepoint".
        >
        >
        > Allow me to quote the Recode manual (:-):
        >
        >
        > Universal Transformation Format, 16 bits
        > ========================================
        >
        > Another external surface of `UCS' is also variable length, each
        > character using either two or four bytes. It is usable for the subset
        > defined by the first million characters (17 * 2^16) of `UCS'.
        >
        > Martin J. Du"rst writes (to `comp.std.internat', on 1995-03-28):
        >
        > `UTF-16' is another method that reserves two times 1024 codepoints
        > in Unicode and uses them to index around one million additional
        > characters. `UTF-16' is a little bit like former multibyte codes,
        > but quite not so, as both the first and the second 16-bit code
        > clearly show what they are. The idea is that one million
        > codepoints should be enough for all the rare Chinese ideograms and
        > historical scripts that do not fit into the Base Multilingual
        > Plane of ISO 10646 (with just about 63,000 positions available,
        > now that 2,000 are gone).
        >
        > This charset is available in `recode' under the name `UTF-16'.
        > Accepted aliases are `Unicode', `TF-16' and `u6'.
        >

        I think we are talkng at crossed purposes. The use of 2x1024 codepoints
        for the surrogate pair values means there are 2x1024 less characters
        that could be encoded using the codespace, but this does not reduce the
        size of the codespace. The codespace is 21 bit, the number of distinct
        characters encoded in the codespace is less than this.

        TTFN

        Mike
        --
        There has been an alarming increase in the number of things I know
        nothing about.
      Your message has been successfully submitted and would be delivered to recipients shortly.