Loading ...
Sorry, an error occurred while loading the content.
 

Re: multibyte and 'encoding' and 'fileencoding'

Expand Messages
  • Benji Fisher
    ... Thanks for the details. That s already more than I think I need to know (for now) so I am not going to follow the link. Perhaps my :put command should
    Message 1 of 12 , Jan 1, 2003
      Antoine J. Mechelynck wrote:
      > Benji Fisher <benji@...> wrote:
      > [...]
      >
      >> I am surprised that a file is not supposed to contain a raw
      >>"\xe4". What is to stop me from doing
      >>
      >>
      >>>put=\"xe4\"
      >>
      >>--Benji Fisher
      >
      >
      > If your file is in UTF-8, then obviously it must obey UTF-8 encoding rules;
      > and these rules say (among other things) that:
      >
      > - Codepoints from 0000 to 007F are compatible with us-ascii and are
      > encoded as one byte, with high bit off
      > - Codepoints from 0080 upwards are encoded as a string of 2 or more
      > bytes; the first of those is greater than 0xC0, the other(s) lie in the
      > range 0x80-0xBF. The number of highbits in the first byte determines the
      > number of following bytes
      >
      > So there is a strict separation between single-bytes (0x00-0x7F),
      > first-bytes (0xC0-0xFF, and not all values in that range are legal) and
      > following-bytes (0x80-0xBF) to avoid context ambiguity.
      >
      > Details can be found somewhere on the Unicode site, whose entry page is at
      > http://www.unicode.org/ . And don't forget that if 'encoding' is set to
      > utf-8, then all files will be internally represented as UTF-8 while editing,
      > with translation when reading or writing non-UTF-8 files. So typing (in
      > Insert mode) Ctrl-V followed by xE4 will enter the 00E4 codepoint into
      > memory as two bytes, 0xC3 0xA4, but show it as one character, small a with
      > umlaut; and pressing x once in Normal mode with the cursor on that
      > chatracter deletes both bytes.

      Thanks for the details. That's already more than I think I need
      to know (for now) so I am not going to follow the link.

      Perhaps my :put command should also insert the 00E4 codepoint, the
      same as <C-V>xE4 in Insert mode.
      <later>
      On another thread (multibyte in patterns) Bram suggests a new "\uab"
      instead of "\xab". Maybe that is the way to go...

      --Benji Fisher
    • Antoine J. Mechelynck
      Benji Fisher wrote: [...] ... That would, if done correctly, avoid putting invalid byte-sequences into UTF-8 files. ... I saw that
      Message 2 of 12 , Jan 1, 2003
        Benji Fisher <benji@...> wrote:
        [...]
        > Perhaps my :put command should also insert the 00E4 codepoint, the
        > same as <C-V>xE4 in Insert mode.

        That would, if done correctly, avoid putting invalid byte-sequences into
        UTF-8 files.

        > <later>
        > On another thread (multibyte in patterns) Bram suggests a new "\uab"
        > instead of "\xab". Maybe that is the way to go...

        I saw that message from Bram, and noticed a patch that went with it. I think
        it's a good idea; but since I lack a vim-compile facility, I shall wait
        until it is incorporated into a (supposedly stable) binary distribution. (At
        the moment I am using gvim 6.1.243 +win32 +ole.)

        >
        > --Benji Fisher

        Tony.
      Your message has been successfully submitted and would be delivered to recipients shortly.