Loading ...
Sorry, an error occurred while loading the content.
 

Re: multibyte and 'encoding' and 'fileencoding'

Expand Messages
  • Antoine J. Mechelynck
    Benji Fisher wrote: [...] ... If your file is in UTF-8, then obviously it must obey UTF-8 encoding rules; and these rules say (among
    Message 1 of 12 , Jan 1, 2003
      Benji Fisher <benji@...> wrote:
      [...]
      > I am surprised that a file is not supposed to contain a raw
      > "\xe4". What is to stop me from doing
      >
      > > put=\"xe4\"
      >
      > --Benji Fisher

      If your file is in UTF-8, then obviously it must obey UTF-8 encoding rules;
      and these rules say (among other things) that:

      - Codepoints from 0000 to 007F are compatible with us-ascii and are
      encoded as one byte, with high bit off
      - Codepoints from 0080 upwards are encoded as a string of 2 or more
      bytes; the first of those is greater than 0xC0, the other(s) lie in the
      range 0x80-0xBF. The number of highbits in the first byte determines the
      number of following bytes

      So there is a strict separation between single-bytes (0x00-0x7F),
      first-bytes (0xC0-0xFF, and not all values in that range are legal) and
      following-bytes (0x80-0xBF) to avoid context ambiguity.

      Details can be found somewhere on the Unicode site, whose entry page is at
      http://www.unicode.org/ . And don't forget that if 'encoding' is set to
      utf-8, then all files will be internally represented as UTF-8 while editing,
      with translation when reading or writing non-UTF-8 files. So typing (in
      Insert mode) Ctrl-V followed by xE4 will enter the 00E4 codepoint into
      memory as two bytes, 0xC3 0xA4, but show it as one character, small a with
      umlaut; and pressing x once in Normal mode with the cursor on that
      chatracter deletes both bytes.

      HTH,
      Tony.
    • Benji Fisher
      ... Thanks for the details. That s already more than I think I need to know (for now) so I am not going to follow the link. Perhaps my :put command should
      Message 2 of 12 , Jan 1, 2003
        Antoine J. Mechelynck wrote:
        > Benji Fisher <benji@...> wrote:
        > [...]
        >
        >> I am surprised that a file is not supposed to contain a raw
        >>"\xe4". What is to stop me from doing
        >>
        >>
        >>>put=\"xe4\"
        >>
        >>--Benji Fisher
        >
        >
        > If your file is in UTF-8, then obviously it must obey UTF-8 encoding rules;
        > and these rules say (among other things) that:
        >
        > - Codepoints from 0000 to 007F are compatible with us-ascii and are
        > encoded as one byte, with high bit off
        > - Codepoints from 0080 upwards are encoded as a string of 2 or more
        > bytes; the first of those is greater than 0xC0, the other(s) lie in the
        > range 0x80-0xBF. The number of highbits in the first byte determines the
        > number of following bytes
        >
        > So there is a strict separation between single-bytes (0x00-0x7F),
        > first-bytes (0xC0-0xFF, and not all values in that range are legal) and
        > following-bytes (0x80-0xBF) to avoid context ambiguity.
        >
        > Details can be found somewhere on the Unicode site, whose entry page is at
        > http://www.unicode.org/ . And don't forget that if 'encoding' is set to
        > utf-8, then all files will be internally represented as UTF-8 while editing,
        > with translation when reading or writing non-UTF-8 files. So typing (in
        > Insert mode) Ctrl-V followed by xE4 will enter the 00E4 codepoint into
        > memory as two bytes, 0xC3 0xA4, but show it as one character, small a with
        > umlaut; and pressing x once in Normal mode with the cursor on that
        > chatracter deletes both bytes.

        Thanks for the details. That's already more than I think I need
        to know (for now) so I am not going to follow the link.

        Perhaps my :put command should also insert the 00E4 codepoint, the
        same as <C-V>xE4 in Insert mode.
        <later>
        On another thread (multibyte in patterns) Bram suggests a new "\uab"
        instead of "\xab". Maybe that is the way to go...

        --Benji Fisher
      • Antoine J. Mechelynck
        Benji Fisher wrote: [...] ... That would, if done correctly, avoid putting invalid byte-sequences into UTF-8 files. ... I saw that
        Message 3 of 12 , Jan 1, 2003
          Benji Fisher <benji@...> wrote:
          [...]
          > Perhaps my :put command should also insert the 00E4 codepoint, the
          > same as <C-V>xE4 in Insert mode.

          That would, if done correctly, avoid putting invalid byte-sequences into
          UTF-8 files.

          > <later>
          > On another thread (multibyte in patterns) Bram suggests a new "\uab"
          > instead of "\xab". Maybe that is the way to go...

          I saw that message from Bram, and noticed a patch that went with it. I think
          it's a good idea; but since I lack a vim-compile facility, I shall wait
          until it is incorporated into a (supposedly stable) binary distribution. (At
          the moment I am using gvim 6.1.243 +win32 +ole.)

          >
          > --Benji Fisher

          Tony.
        Your message has been successfully submitted and would be delivered to recipients shortly.