Loading ...
Sorry, an error occurred while loading the content.
 

Re: is vim putting the right bytes for utf-8 ?

Expand Messages
  • Mike Williams
    ... I think we are talkng at crossed purposes. The use of 2x1024 codepoints for the surrogate pair values means there are 2x1024 less characters that could be
    Message 1 of 45 , Aug 1, 2005
      Fran├žois Pinard did utter on 01/08/2005 14:15:
      > [Mike Williams]
      >
      >
      >>>>The Unicode codespace is 21-bit only
      >
      >
      >>>Roughly, but not exactly. You have to exclude surrogates from the
      >>>codespace.
      >
      >
      >>Why do you say that? They are defined ranges of values in the 21-bit
      >>codespace, and have well defined semantics in all Unicode encodings.
      >>Every value in the 21-bit codespace has an assignment, even if it is
      >>"unassigned codepoint".
      >
      >
      > Allow me to quote the Recode manual (:-):
      >
      >
      > Universal Transformation Format, 16 bits
      > ========================================
      >
      > Another external surface of `UCS' is also variable length, each
      > character using either two or four bytes. It is usable for the subset
      > defined by the first million characters (17 * 2^16) of `UCS'.
      >
      > Martin J. Du"rst writes (to `comp.std.internat', on 1995-03-28):
      >
      > `UTF-16' is another method that reserves two times 1024 codepoints
      > in Unicode and uses them to index around one million additional
      > characters. `UTF-16' is a little bit like former multibyte codes,
      > but quite not so, as both the first and the second 16-bit code
      > clearly show what they are. The idea is that one million
      > codepoints should be enough for all the rare Chinese ideograms and
      > historical scripts that do not fit into the Base Multilingual
      > Plane of ISO 10646 (with just about 63,000 positions available,
      > now that 2,000 are gone).
      >
      > This charset is available in `recode' under the name `UTF-16'.
      > Accepted aliases are `Unicode', `TF-16' and `u6'.
      >

      I think we are talkng at crossed purposes. The use of 2x1024 codepoints
      for the surrogate pair values means there are 2x1024 less characters
      that could be encoded using the codespace, but this does not reduce the
      size of the codespace. The codespace is 21 bit, the number of distinct
      characters encoded in the codespace is less than this.

      TTFN

      Mike
      --
      There has been an alarming increase in the number of things I know
      nothing about.
    • Mike Williams
      ... I think we are talkng at crossed purposes. The use of 2x1024 codepoints for the surrogate pair values means there are 2x1024 less characters that could be
      Message 45 of 45 , Aug 1, 2005
        Fran├žois Pinard did utter on 01/08/2005 14:15:
        > [Mike Williams]
        >
        >
        >>>>The Unicode codespace is 21-bit only
        >
        >
        >>>Roughly, but not exactly. You have to exclude surrogates from the
        >>>codespace.
        >
        >
        >>Why do you say that? They are defined ranges of values in the 21-bit
        >>codespace, and have well defined semantics in all Unicode encodings.
        >>Every value in the 21-bit codespace has an assignment, even if it is
        >>"unassigned codepoint".
        >
        >
        > Allow me to quote the Recode manual (:-):
        >
        >
        > Universal Transformation Format, 16 bits
        > ========================================
        >
        > Another external surface of `UCS' is also variable length, each
        > character using either two or four bytes. It is usable for the subset
        > defined by the first million characters (17 * 2^16) of `UCS'.
        >
        > Martin J. Du"rst writes (to `comp.std.internat', on 1995-03-28):
        >
        > `UTF-16' is another method that reserves two times 1024 codepoints
        > in Unicode and uses them to index around one million additional
        > characters. `UTF-16' is a little bit like former multibyte codes,
        > but quite not so, as both the first and the second 16-bit code
        > clearly show what they are. The idea is that one million
        > codepoints should be enough for all the rare Chinese ideograms and
        > historical scripts that do not fit into the Base Multilingual
        > Plane of ISO 10646 (with just about 63,000 positions available,
        > now that 2,000 are gone).
        >
        > This charset is available in `recode' under the name `UTF-16'.
        > Accepted aliases are `Unicode', `TF-16' and `u6'.
        >

        I think we are talkng at crossed purposes. The use of 2x1024 codepoints
        for the surrogate pair values means there are 2x1024 less characters
        that could be encoded using the codespace, but this does not reduce the
        size of the codespace. The codespace is 21 bit, the number of distinct
        characters encoded in the codespace is less than this.

        TTFN

        Mike
        --
        There has been an alarming increase in the number of things I know
        nothing about.
      Your message has been successfully submitted and would be delivered to recipients shortly.