Loading ...
Sorry, an error occurred while loading the content.

Re: Multibyte bugs (Update)

Expand Messages
  • Tony Mechelynck
    ... Update: There is a second case which triggers incorrect behaviour in when encoding is UTF-8: - As noted above, after every 0x80 byte in
    Message 1 of 1 , Apr 7, 2010
    • 0 Attachment
      On 03/04/10 06:36, Tony Mechelynck wrote:
      > Hi Bram,
      >
      > 1. (Minor bug): On this system (gvim 7.2.411, Huge version with
      > GTK2-GNOME GUI), typing Ctrl-K in Insert mode followed by two spaces
      > doesn't give the expected result: instead of U+00A0 ("Alt-space", the
      > non-breaking space) I get U+E000, a CJK character. Ctrl-K NS works
      > correctly.
      >
      > 2. U+E000 is displayed in gvim as CJK halfwidth. Shouldn't it be fullwidth?
      >
      > 3. "\<Char-nnnn>" gives wrong results for some Unicode codepoints. I
      > tried to find examples and counterexamples, as follows (in the comment
      > after the :echo statements, the UTF-8 expansion in hex):
      >
      > :echo "«\<Char-0x40>" | " 40
      > «@»
      > :echo "«\<Char-0x80>" | " C2 80
      > «<80><fe>X»
      > :echo "«\<Char-0x100>»" | " C4 80
      > «Ā<fe>X»
      > :echo "«\<Char-0x101>»" | " C4 81
      > «ā»
      > :echo "«\<Char-0x180>»" | " C6 80
      > «ƀ<fe>X»
      > :echo "«\<Char-0x190>»" | " C6 90
      > «Ɛ»
      > :echo "«\<Char-0x1A0>»" | " C6 A0
      > «Ơ»
      > :echo "«\<Char-0x1C0>»" | " C7 80
      > «ǀ<fe>X»
      > :echo "«\<Char-0x4E00>»" | " E4 B8 80
      > «一<fe>X»
      > :echo "«\<Char-0x4E01>»" | " E4 B8 81
      > «丁»
      > :echo "«\<Char-0x4E20>»" | " E4 B8 A0
      > «丠»
      > :echo "«\<Char-0x4E40>»" | " E4 B9 80
      > «乀<fe>X»
      > :echo "«\<Char-0xE000>»" | " EE 80 80
      > «<ee><80><fe>X<80><fe>X»
      > :echo "«\<Char-57344>»" | " EE 80 80
      > «<ee><80><fe>X<80><fe>X»
      > :echo "«\<Char-0xE001>»" | " EE 80 81
      > «<ee><80><fe>X<81>»"
      > :echo "«\<Char-0xE040>»" | " EE 81 80
      > «<fe>X»
      >
      > This seems to indicate that the extra bytes 0xFE 0x58 appear after any
      > 0x80 in the UTF-8 expansion of the character. (I added the « »
      > characters to "bound" the display so that any extra whitespace would be
      > visible but they change nothing to the bug.)
      >
      > The bug does not occur after Ctrl-V u in Insert mode or when using
      > <Char-...> in an Insert-mode mapping. It does when using "\<Char-...>"
      > in other commands than :echo. Note the following:
      >
      > :let j = "\<Char-0xE000>"
      > :let j
      > j <ee><80><fe>X<80><fe>X
      > i<Ctrl-R>=j<Enter>
      > î<t_þ>X<t_þ>X
      >
      > (where <Ctrl-R> and <Enter> are one keystroke each, not counting
      > modifiers). Apparently gvim tries to interpret 0x80 0xFE as a "special
      > key", and "resolves" it (incorrectly) as <t_þ>.
      >
      > Two very big files were loaded when I first noticed bug #3, but
      > restarting gvim without them reproduced the bug again with the same
      > spurious bytes.
      >
      >
      > Best regards,
      > Tony.

      Update: There is a second case which triggers incorrect behaviour in
      "\<Char-nnnn>" when 'encoding' is UTF-8:

      - As noted above, after every 0x80 byte in the UTF-8 representation, the
      bytes 0xFE 0x58 are spuriously added: after the UTF-8 string if the 0x80
      is its last byte (giving two invalid bytes after the correct multibyte
      glyph), and/or in the middle of it if there is a 0x80 byte other than
      the last (making the whole multibyte sequence invalid; the 0x80 can
      never be the first byte, because the first byte of a multibyte UTF-8
      sequence is >= 0xC0 [0xC2 actually, except for "overlong" sequences
      representing ASCII bytes], and it can not be an "only byte" because
      single-byte sequences are <= 0x7F).

      - In addition, after every 0x9B byte, the bytes 0xFD 0x4F are added,
      also immediately after that byte, breaking the UTF-8 sequence if it
      isn't the last byte.

      - The above are repeatable "every time", even from one run of gvim to
      the next, and I always get 0x80 0xFE 0x58 instead of 0x80, and 0x9B 0xFD
      0x4F instead of 0x9B, in all the UTF-8 sequences generated by the
      "\<Char-nnnn>" construct.

      - Removing the spurious bytes (including those in the middle of a byte
      sequence) make the correct multibyte glyph appear immediately (I'm
      assuming, of course, that 'encoding' is still set to UTF-8).

      - The fact that those two byte values, 0x80 aka Alt-Null and 0x9B aka
      Alt-Escape aka CSI, play special roles in gvim's representation of
      special keys, might help to spot where the bug comes from. (Yes, did I
      say it? I tested all this in GUI mode, in my usual "Huge" gvim with
      GTK2/Gnome GUI, and, of course, with +multi_byte among others. Currently
      at patchlevel 7.2.411)


      I'm crossposting this update to vim_dev because my first post (in
      vim_multibyte) got no reply whatsoever; but it was only four days ago,
      and the Easter holiday is upon us; maybe I wasn't patient enough.


      Have a nice holiday, and Happy Vimming!
      Tony.
      --
      Immortality -- a fate worse than death.
      -- Edgar A. Shoaff

      --
      You received this message from the "vim_dev" maillist.
      Do not top-post! Type your reply below the text you are replying to.
      For more information, visit http://www.vim.org/maillist.php

      To unsubscribe, reply using "remove me" as the subject.
    Your message has been successfully submitted and would be delivered to recipients shortly.