Loading ...
Sorry, an error occurred while loading the content.

60741Re: Spelling support doesn´t deal with ` ´´ correctly

Expand Messages
  • Tony Mechelynck
    Mar 15, 2011
    • 0 Attachment
      On 15/03/11 15:49, Gary Johnson wrote:
      > On 2011-03-11, Nikolai Weibull wrote:
      >
      >> But this is a big "whatever". As latin1 (or, more appropriately,
      >> iso-8859-1) is a superset of ASCII and Unicode is a superset of
      >> latin1, then what I really care about is having support for Unicode
      >> quotes.
      >
      > Latin1 is a superset of ASCII, but Unicode is not a superset of
      > latin1. Unicode supports a larger set of characters than latin1 and
      > shares some character encodings in common with latin1 but it is a
      > different encoding.
      >
      > Regards,
      > Gary
      >

      Unicode is a superset of Latin1 in the sense that every Latin1 character
      is also a Unicode codepoint, and at the same ordinal position (the first
      256 Unicode codepoints are the 256 Latin1 characters in the same order).

      However no Unicode encoding represents Latin1 characters higher than
      0x7F *on disk* by the same binary value that Latin1 does (UTF-8, but not
      the other Unicode encodings except maybe --I'm not sure-- GB18030,
      represents the 128 US-ASCII characters the same way as both US-ASCII and
      Latin1).

      <encyclopedia>
      The above paragraph implies that Unicode is not *one* encoding, even
      though Vim represents all Unicode codepoints the same way *in memory*.
      Rather, Unicode should be seen as a way of classifying all known writing
      systems as a one-dimensional list going from zero to "something high" by
      integer steps or "codepoints". These codepoints may be coded as bytes in
      different ways:
      * UTF-8, which uses one or more bytes per codepoint, and where the byte
      0x00 can only represent the codepoint U+0000 (the null codepoint) so
      it's useful for a representation using C strings. The first byte used
      for any codepoint tells how many bytes there will be in all, the other
      ones (if any) have values which cannot happen in the first byte, so
      synchronization is easy even if corrupt bytes become embedded in the text.
      * UCS-2, which uses one two-byte word (big-endian or little-endian) per
      codepoint and cannot represent any codepoint higher than U+FFFF
      * UTF-16, which extends UCS-2 up to U+10FFFF by means of "surrogate
      codepoints", using two words for codepoints higher than U+FFFF
      * UCS-4 aka UTF-32, which can be big-endian or little-endian (or even,
      I've been told, ordered 2143 or 3412) and uses one four-byte doubleword
      per codepoint. It simply stores each codepoint as its ordinal value
      expressed as one unsigned 32-bit integer.
      * GB18030, which is skewed in favour of Chinese; it allows
      representation of any Unicode codepoint but the conversion in either
      direction between it and other Unicode encodings requires bulky tables.

      Conversion between any of the above except GB18030 is trivial; Vim does
      it with no need for the iconv library. For UCS-2, UTF-16 and UTF-32,
      when the endianness is omitted, big-endian is implied, even on
      little-endian processors such as the Intel ones used in all Windows PCs,
      most Linux ones, and many of those equipped with Mac OSX.
      </encyclopedia>


      Best regards,
      Tony.
      --
      Champagne don't make me lazy.
      Cocaine don't drive me crazy.
      Ain't nobody's business but my own.
      -- Taj Mahal

      --
      You received this message from the "vim_dev" maillist.
      Do not top-post! Type your reply below the text you are replying to.
      For more information, visit http://www.vim.org/maillist.php
    • Show all 25 messages in this topic