Re: is vim putting the right bytes for utf-8 ?
- drchip@... did utter on 29/07/2005 15:55:
> Quoting Mike Williams <mike.williams@...>:Ok damaged was possibly too emotive a word. It would be more accurate
>>Just taking your last statement, then VIM cannot be described as Unicode
>>conformant. VIM supports various aspects of the Unicode standard, but
>>is not conformant. A such no guarantees are made about the correctness
>>of processing files. If someone requires conformance then they will
>>need 3rd party tools to validate VIMs output.
>>Consider VIM as a Swiss Army Editor - it can do many things but it will
>>also let you damage stuff without any warning. Just as long as people
>>are aware, that's all.
> Vim lets one edit files. Changing things necessarily means that they can
> be "damaged". Vim won't "damage things" unexpectedly, given the editing
> commands provided to it. So, although your statement is technically correct
> (for example, :%d will "damage" a file without warning), your implied meaning
> is false.
to say VIM lets you produce semantically invalid Unicode files without
He's eight feet tall and plays the flute, he's clearly high-flutin'.
- François Pinard did utter on 01/08/2005 14:15:
> [Mike Williams]I think we are talkng at crossed purposes. The use of 2x1024 codepoints
>>>>The Unicode codespace is 21-bit only
>>>Roughly, but not exactly. You have to exclude surrogates from the
>>Why do you say that? They are defined ranges of values in the 21-bit
>>codespace, and have well defined semantics in all Unicode encodings.
>>Every value in the 21-bit codespace has an assignment, even if it is
> Allow me to quote the Recode manual (:-):
> Universal Transformation Format, 16 bits
> Another external surface of `UCS' is also variable length, each
> character using either two or four bytes. It is usable for the subset
> defined by the first million characters (17 * 2^16) of `UCS'.
> Martin J. Du"rst writes (to `comp.std.internat', on 1995-03-28):
> `UTF-16' is another method that reserves two times 1024 codepoints
> in Unicode and uses them to index around one million additional
> characters. `UTF-16' is a little bit like former multibyte codes,
> but quite not so, as both the first and the second 16-bit code
> clearly show what they are. The idea is that one million
> codepoints should be enough for all the rare Chinese ideograms and
> historical scripts that do not fit into the Base Multilingual
> Plane of ISO 10646 (with just about 63,000 positions available,
> now that 2,000 are gone).
> This charset is available in `recode' under the name `UTF-16'.
> Accepted aliases are `Unicode', `TF-16' and `u6'.
for the surrogate pair values means there are 2x1024 less characters
that could be encoded using the codespace, but this does not reduce the
size of the codespace. The codespace is 21 bit, the number of distinct
characters encoded in the codespace is less than this.
There has been an alarming increase in the number of things I know