Re: is vim putting the right bytes for utf-8 ?
- adah@... did utter on 31/07/2005 10:31:
>>Just taking your last statement, then VIM cannot be described as UnicodeThat is not so, I believe you can claim Unicode conformance and just
>>conformant. VIM supports various aspects of the Unicode standard, but
>>is not conformant. A such no guarantees are made about the correctness
>>of processing files. If someone requires conformance then they will
>>need 3rd party tools to validate VIMs output.
> No one claims Vim is Unicode 4.0 conformant, if that is what you want
> (you'll have to go for something else, presumably having a high price
> lable). It is not in many ways. I really do not know of a Unicode text
> editor that can treat combining characters as Unicode 4.0 requires. Vim
> does meet the daily requirements of its users to edit Unicode text.
support editing of 7-bit ASCII. Not many people would use but it is
allowable. A key aspect of claiming conformance is stating what
characters you support, which usually is in terms of complete national
character sets. However, there are additional processing requirements
to achieve conformance.
>>Consider VIM as a Swiss Army Editor - it can do many things but it willI did not same VIM would but that the user is able to uninentionally
>>also let you damage stuff without any warning. Just as long as people
>>are aware, that's all.
> This is unfair. Vim will not damage anything, if the input is valid.
generate invalid content. There is some similarity in that VIM is not
an XML editor (will not do XSD validation, etc.) however, users may be
expecting character editing to correct, after Unicode is just characters
I'm out of my mind, but feel free to leave a message.
- François Pinard did utter on 01/08/2005 14:15:
> [Mike Williams]I think we are talkng at crossed purposes. The use of 2x1024 codepoints
>>>>The Unicode codespace is 21-bit only
>>>Roughly, but not exactly. You have to exclude surrogates from the
>>Why do you say that? They are defined ranges of values in the 21-bit
>>codespace, and have well defined semantics in all Unicode encodings.
>>Every value in the 21-bit codespace has an assignment, even if it is
> Allow me to quote the Recode manual (:-):
> Universal Transformation Format, 16 bits
> Another external surface of `UCS' is also variable length, each
> character using either two or four bytes. It is usable for the subset
> defined by the first million characters (17 * 2^16) of `UCS'.
> Martin J. Du"rst writes (to `comp.std.internat', on 1995-03-28):
> `UTF-16' is another method that reserves two times 1024 codepoints
> in Unicode and uses them to index around one million additional
> characters. `UTF-16' is a little bit like former multibyte codes,
> but quite not so, as both the first and the second 16-bit code
> clearly show what they are. The idea is that one million
> codepoints should be enough for all the rare Chinese ideograms and
> historical scripts that do not fit into the Base Multilingual
> Plane of ISO 10646 (with just about 63,000 positions available,
> now that 2,000 are gone).
> This charset is available in `recode' under the name `UTF-16'.
> Accepted aliases are `Unicode', `TF-16' and `u6'.
for the surrogate pair values means there are 2x1024 less characters
that could be encoded using the codespace, but this does not reduce the
size of the codespace. The codespace is 21 bit, the number of distinct
characters encoded in the codespace is less than this.
There has been an alarming increase in the number of things I know