Loading ...
Sorry, an error occurred while loading the content.

52203Re: Gvim for Windows doesn't handle non-BMP characters when interchanging data with Windows OS

Expand Messages
  • Tony Mechelynck
    Oct 22, 2008
    • 0 Attachment
      On 22/10/08 15:55, JiaYanwei wrote:
      > When interchanging data with Windows such as clipboard operation, gvim will
      > convert the text into UCS-2 encoding, but different from UTF-16, UCS-2
      > can't
      > encode non-BMP characters.
      >
      > For example, when paste a non-BMP character U+248BB from Windows clipboard,
      > it will insert two separated characters <d852> <dcbb>. It is caused by the
      > function ucs2_to_utf8() in src/os_mswin.c, which treates the surrogate
      > pairs
      > as separated unicode characters, and convert it into bad UTF-8 sequence
      > 0xED 0xA1 0x92 0xED 0xB2 0xBB -- the correct UTF-8 sequence should be
      > 0xF0 0xA4 0xA2 0xBB.
      >
      > Similarly, when copy a non-BMP character U+248BB into Windows clipboard,
      > the
      > content of clipboard will be U+48BB, because the function utf8_to_ucs2()
      > in src/os_mswin.c will cast the integer 0x248BB into a short integer 0x48BB.
      >
      > The attachment is a patch. The surrogate pairs handling has been add
      > into the
      > two functions mentioned above. This make the non-BMP characters can be
      > correctly interchanged with Windows clipboard as I had tested:
      > Non-BMP character paste from/copy into Windows clipboard
      > +----------+--------------------------------+------------------------+
      > | | WindowsXP with GB18030 support | Windows 98 |
      > +----------+--------------------------------+------------------------+
      > | editing | before patch works bad | before patch works bad |
      > | UTF-* or | after patch works OK | after patch works OK |
      > | UCS-4* | | |
      > | text | | |
      > +----------+--------------------------------+------------------------+
      > | editing | before patch works bad | ( can not edit |
      > | GB18030 | after patch works OK | GB18030 text ) |
      > | text | | |
      > +----------+--------------------------------+------------------------+
      > B.T.W.: It seems better to replace the functions name mentioned above with
      > "utf16_to_utf8" and "utf8_to_utf16", I think.
      >
      > Best regards,
      > Yanwei.

      I expect this is related with the UTF-16le BOM problem you noticed this
      past Saturday. Maybe a combined patch would be OK, since in both cases,
      the problem involves using UCS-2 (where surrogates are undefined)
      instead of UTF-16 (where surrogate pairs encode codepoints above the BMP)?


      Best regards,
      Tony.
      --
      A public debt is a kind of anchor in the storm; but if the anchor be
      too heavy for the vessel, she will be sunk by that very weight which
      was intended for her preservation.
      -- Colton

      --~--~---------~--~----~------------~-------~--~----~
      You received this message from the "vim_dev" maillist.
      For more information, visit http://www.vim.org/maillist.php
      -~----------~----~----~----~------~----~------~--~---
    • Show all 7 messages in this topic