60741Re: Spelling support doesn´t deal with ` ´´ correctly
- Mar 15, 2011On 15/03/11 15:49, Gary Johnson wrote:
> On 2011-03-11, Nikolai Weibull wrote:Unicode is a superset of Latin1 in the sense that every Latin1 character
>> But this is a big "whatever". As latin1 (or, more appropriately,
>> iso-8859-1) is a superset of ASCII and Unicode is a superset of
>> latin1, then what I really care about is having support for Unicode
> Latin1 is a superset of ASCII, but Unicode is not a superset of
> latin1. Unicode supports a larger set of characters than latin1 and
> shares some character encodings in common with latin1 but it is a
> different encoding.
is also a Unicode codepoint, and at the same ordinal position (the first
256 Unicode codepoints are the 256 Latin1 characters in the same order).
However no Unicode encoding represents Latin1 characters higher than
0x7F *on disk* by the same binary value that Latin1 does (UTF-8, but not
the other Unicode encodings except maybe --I'm not sure-- GB18030,
represents the 128 US-ASCII characters the same way as both US-ASCII and
The above paragraph implies that Unicode is not *one* encoding, even
though Vim represents all Unicode codepoints the same way *in memory*.
Rather, Unicode should be seen as a way of classifying all known writing
systems as a one-dimensional list going from zero to "something high" by
integer steps or "codepoints". These codepoints may be coded as bytes in
* UTF-8, which uses one or more bytes per codepoint, and where the byte
0x00 can only represent the codepoint U+0000 (the null codepoint) so
it's useful for a representation using C strings. The first byte used
for any codepoint tells how many bytes there will be in all, the other
ones (if any) have values which cannot happen in the first byte, so
synchronization is easy even if corrupt bytes become embedded in the text.
* UCS-2, which uses one two-byte word (big-endian or little-endian) per
codepoint and cannot represent any codepoint higher than U+FFFF
* UTF-16, which extends UCS-2 up to U+10FFFF by means of "surrogate
codepoints", using two words for codepoints higher than U+FFFF
* UCS-4 aka UTF-32, which can be big-endian or little-endian (or even,
I've been told, ordered 2143 or 3412) and uses one four-byte doubleword
per codepoint. It simply stores each codepoint as its ordinal value
expressed as one unsigned 32-bit integer.
* GB18030, which is skewed in favour of Chinese; it allows
representation of any Unicode codepoint but the conversion in either
direction between it and other Unicode encodings requires bulky tables.
Conversion between any of the above except GB18030 is trivial; Vim does
it with no need for the iconv library. For UCS-2, UTF-16 and UTF-32,
when the endianness is omitted, big-endian is implied, even on
little-endian processors such as the Intel ones used in all Windows PCs,
most Linux ones, and many of those equipped with Mac OSX.
Champagne don't make me lazy.
Cocaine don't drive me crazy.
Ain't nobody's business but my own.
-- Taj Mahal
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php
- << Previous post in topic Next post in topic >>