Loading ...
Sorry, an error occurred while loading the content.

Re: is vim putting the right bytes for utf-8 ?

Expand Messages
  • Mike Williams
    ... That is not so, I believe you can claim Unicode conformance and just support editing of 7-bit ASCII. Not many people would use but it is allowable. A key
    Message 1 of 45 , Aug 1, 2005
    • 0 Attachment
      adah@... did utter on 31/07/2005 10:31:
      >>Just taking your last statement, then VIM cannot be described as Unicode
      >
      >
      >>conformant. VIM supports various aspects of the Unicode standard, but
      >>is not conformant. A such no guarantees are made about the correctness
      >>of processing files. If someone requires conformance then they will
      >>need 3rd party tools to validate VIMs output.
      >
      >
      > No one claims Vim is Unicode 4.0 conformant, if that is what you want
      > (you'll have to go for something else, presumably having a high price
      > lable). It is not in many ways. I really do not know of a Unicode text
      > editor that can treat combining characters as Unicode 4.0 requires. Vim
      > does meet the daily requirements of its users to edit Unicode text.

      That is not so, I believe you can claim Unicode conformance and just
      support editing of 7-bit ASCII. Not many people would use but it is
      allowable. A key aspect of claiming conformance is stating what
      characters you support, which usually is in terms of complete national
      character sets. However, there are additional processing requirements
      to achieve conformance.

      >>Consider VIM as a Swiss Army Editor - it can do many things but it will
      >>also let you damage stuff without any warning. Just as long as people
      >>are aware, that's all.
      >
      >
      > This is unfair. Vim will not damage anything, if the input is valid.

      I did not same VIM would but that the user is able to uninentionally
      generate invalid content. There is some similarity in that VIM is not
      an XML editor (will not do XSD validation, etc.) however, users may be
      expecting character editing to correct, after Unicode is just characters
      isn't it?

      TTFN

      Mike
      --
      I'm out of my mind, but feel free to leave a message.
    • Mike Williams
      ... I think we are talkng at crossed purposes. The use of 2x1024 codepoints for the surrogate pair values means there are 2x1024 less characters that could be
      Message 45 of 45 , Aug 1, 2005
      • 0 Attachment
        Fran├žois Pinard did utter on 01/08/2005 14:15:
        > [Mike Williams]
        >
        >
        >>>>The Unicode codespace is 21-bit only
        >
        >
        >>>Roughly, but not exactly. You have to exclude surrogates from the
        >>>codespace.
        >
        >
        >>Why do you say that? They are defined ranges of values in the 21-bit
        >>codespace, and have well defined semantics in all Unicode encodings.
        >>Every value in the 21-bit codespace has an assignment, even if it is
        >>"unassigned codepoint".
        >
        >
        > Allow me to quote the Recode manual (:-):
        >
        >
        > Universal Transformation Format, 16 bits
        > ========================================
        >
        > Another external surface of `UCS' is also variable length, each
        > character using either two or four bytes. It is usable for the subset
        > defined by the first million characters (17 * 2^16) of `UCS'.
        >
        > Martin J. Du"rst writes (to `comp.std.internat', on 1995-03-28):
        >
        > `UTF-16' is another method that reserves two times 1024 codepoints
        > in Unicode and uses them to index around one million additional
        > characters. `UTF-16' is a little bit like former multibyte codes,
        > but quite not so, as both the first and the second 16-bit code
        > clearly show what they are. The idea is that one million
        > codepoints should be enough for all the rare Chinese ideograms and
        > historical scripts that do not fit into the Base Multilingual
        > Plane of ISO 10646 (with just about 63,000 positions available,
        > now that 2,000 are gone).
        >
        > This charset is available in `recode' under the name `UTF-16'.
        > Accepted aliases are `Unicode', `TF-16' and `u6'.
        >

        I think we are talkng at crossed purposes. The use of 2x1024 codepoints
        for the surrogate pair values means there are 2x1024 less characters
        that could be encoded using the codespace, but this does not reduce the
        size of the codespace. The codespace is 21 bit, the number of distinct
        characters encoded in the codespace is less than this.

        TTFN

        Mike
        --
        There has been an alarming increase in the number of things I know
        nothing about.
      Your message has been successfully submitted and would be delivered to recipients shortly.