Loading ...
Sorry, an error occurred while loading the content.
 

Re: is vim putting the right bytes for utf-8 ?

Expand Messages
  • Mike Williams
    ... Ok damaged was possibly too emotive a word. It would be more accurate to say VIM lets you produce semantically invalid Unicode files without any warning.
    Message 1 of 45 , Aug 1, 2005
      drchip@... did utter on 29/07/2005 15:55:
      > Quoting Mike Williams <mike.williams@...>:
      >
      >
      >>Just taking your last statement, then VIM cannot be described as Unicode
      >>conformant. VIM supports various aspects of the Unicode standard, but
      >>is not conformant. A such no guarantees are made about the correctness
      >>of processing files. If someone requires conformance then they will
      >>need 3rd party tools to validate VIMs output.
      >>
      >>Consider VIM as a Swiss Army Editor - it can do many things but it will
      >>also let you damage stuff without any warning. Just as long as people
      >>are aware, that's all.
      >
      >
      > Vim lets one edit files. Changing things necessarily means that they can
      > be "damaged". Vim won't "damage things" unexpectedly, given the editing
      > commands provided to it. So, although your statement is technically correct
      > (for example, :%d will "damage" a file without warning), your implied meaning
      > is false.

      Ok damaged was possibly too emotive a word. It would be more accurate
      to say VIM lets you produce semantically invalid Unicode files without
      any warning.

      TTFN

      Mike
      --
      He's eight feet tall and plays the flute, he's clearly high-flutin'.
    • Mike Williams
      ... I think we are talkng at crossed purposes. The use of 2x1024 codepoints for the surrogate pair values means there are 2x1024 less characters that could be
      Message 45 of 45 , Aug 1, 2005
        Fran├žois Pinard did utter on 01/08/2005 14:15:
        > [Mike Williams]
        >
        >
        >>>>The Unicode codespace is 21-bit only
        >
        >
        >>>Roughly, but not exactly. You have to exclude surrogates from the
        >>>codespace.
        >
        >
        >>Why do you say that? They are defined ranges of values in the 21-bit
        >>codespace, and have well defined semantics in all Unicode encodings.
        >>Every value in the 21-bit codespace has an assignment, even if it is
        >>"unassigned codepoint".
        >
        >
        > Allow me to quote the Recode manual (:-):
        >
        >
        > Universal Transformation Format, 16 bits
        > ========================================
        >
        > Another external surface of `UCS' is also variable length, each
        > character using either two or four bytes. It is usable for the subset
        > defined by the first million characters (17 * 2^16) of `UCS'.
        >
        > Martin J. Du"rst writes (to `comp.std.internat', on 1995-03-28):
        >
        > `UTF-16' is another method that reserves two times 1024 codepoints
        > in Unicode and uses them to index around one million additional
        > characters. `UTF-16' is a little bit like former multibyte codes,
        > but quite not so, as both the first and the second 16-bit code
        > clearly show what they are. The idea is that one million
        > codepoints should be enough for all the rare Chinese ideograms and
        > historical scripts that do not fit into the Base Multilingual
        > Plane of ISO 10646 (with just about 63,000 positions available,
        > now that 2,000 are gone).
        >
        > This charset is available in `recode' under the name `UTF-16'.
        > Accepted aliases are `Unicode', `TF-16' and `u6'.
        >

        I think we are talkng at crossed purposes. The use of 2x1024 codepoints
        for the surrogate pair values means there are 2x1024 less characters
        that could be encoded using the codespace, but this does not reduce the
        size of the codespace. The codespace is 21 bit, the number of distinct
        characters encoded in the codespace is less than this.

        TTFN

        Mike
        --
        There has been an alarming increase in the number of things I know
        nothing about.
      Your message has been successfully submitted and would be delivered to recipients shortly.