Loading ...
Sorry, an error occurred while loading the content.
 

Gvim for Windows doesn't handle non-BMP characters when interchanging data with Windows OS

Expand Messages
  • JiaYanwei
    When interchanging data with Windows such as clipboard operation, gvim will convert the text into UCS-2 encoding, but different from UTF-16, UCS-2 can t encode
    Message 1 of 7 , Oct 22, 2008
      When interchanging data with Windows such as clipboard operation, gvim will 
      convert the text into UCS-2 encoding, but different from UTF-16, UCS-2 can't 
      encode non-BMP characters. 

      For example, when paste a non-BMP character U+248BB from Windows clipboard, 
      it will insert two separated characters <d852> <dcbb>. It is caused by the 
      function ucs2_to_utf8() in src/os_mswin.c, which treates the surrogate pairs 
      as separated unicode characters, and convert it into bad UTF-8 sequence 
      0xED 0xA1 0x92 0xED 0xB2 0xBB -- the correct UTF-8 sequence should be 
      0xF0 0xA4 0xA2 0xBB.

      Similarly, when copy a non-BMP character U+248BB into Windows clipboard, the 
      content of clipboard will be U+48BB, because the function utf8_to_ucs2() 
      in src/os_mswin.c will cast the integer 0x248BB into a short integer 0x48BB.

      The attachment is a patch. The surrogate pairs handling has been add into the 
      two functions mentioned above. This make the non-BMP characters can be 
      correctly interchanged with Windows clipboard as I had tested:
      Non-BMP character paste from/copy into Windows clipboard
      +----------+--------------------------------+------------------------+
      |          | WindowsXP with GB18030 support |  Windows 98            |
      +----------+--------------------------------+------------------------+
              | editing  | before patch works bad         | before patch works bad |
      | UTF-* or | after patch works OK           | after patch works OK   |
      | UCS-4*   |                                |                        |
      | text     |                                |                        |
      +----------+--------------------------------+------------------------+
      | editing  | before patch works bad         | ( can not edit         |
      | GB18030  | after patch works OK           |   GB18030 text )       |
      | text     |                                |                        |
      +----------+--------------------------------+------------------------+
      B.T.W.: It seems better to replace the functions name mentioned above with 
      "utf16_to_utf8" and "utf8_to_utf16", I think.

      Best regards,
      Yanwei.
      -
      --~--~---------~--~----~------------~-------~--~----~
      You received this message from the "vim_dev" maillist.
      For more information, visit http://www.vim.org/maillist.php
      -~----------~----~----~----~------~----~------~--~---

    • Tony Mechelynck
      ... I expect this is related with the UTF-16le BOM problem you noticed this past Saturday. Maybe a combined patch would be OK, since in both cases, the problem
      Message 2 of 7 , Oct 22, 2008
        On 22/10/08 15:55, JiaYanwei wrote:
        > When interchanging data with Windows such as clipboard operation, gvim will
        > convert the text into UCS-2 encoding, but different from UTF-16, UCS-2
        > can't
        > encode non-BMP characters.
        >
        > For example, when paste a non-BMP character U+248BB from Windows clipboard,
        > it will insert two separated characters <d852> <dcbb>. It is caused by the
        > function ucs2_to_utf8() in src/os_mswin.c, which treates the surrogate
        > pairs
        > as separated unicode characters, and convert it into bad UTF-8 sequence
        > 0xED 0xA1 0x92 0xED 0xB2 0xBB -- the correct UTF-8 sequence should be
        > 0xF0 0xA4 0xA2 0xBB.
        >
        > Similarly, when copy a non-BMP character U+248BB into Windows clipboard,
        > the
        > content of clipboard will be U+48BB, because the function utf8_to_ucs2()
        > in src/os_mswin.c will cast the integer 0x248BB into a short integer 0x48BB.
        >
        > The attachment is a patch. The surrogate pairs handling has been add
        > into the
        > two functions mentioned above. This make the non-BMP characters can be
        > correctly interchanged with Windows clipboard as I had tested:
        > Non-BMP character paste from/copy into Windows clipboard
        > +----------+--------------------------------+------------------------+
        > | | WindowsXP with GB18030 support | Windows 98 |
        > +----------+--------------------------------+------------------------+
        > | editing | before patch works bad | before patch works bad |
        > | UTF-* or | after patch works OK | after patch works OK |
        > | UCS-4* | | |
        > | text | | |
        > +----------+--------------------------------+------------------------+
        > | editing | before patch works bad | ( can not edit |
        > | GB18030 | after patch works OK | GB18030 text ) |
        > | text | | |
        > +----------+--------------------------------+------------------------+
        > B.T.W.: It seems better to replace the functions name mentioned above with
        > "utf16_to_utf8" and "utf8_to_utf16", I think.
        >
        > Best regards,
        > Yanwei.

        I expect this is related with the UTF-16le BOM problem you noticed this
        past Saturday. Maybe a combined patch would be OK, since in both cases,
        the problem involves using UCS-2 (where surrogates are undefined)
        instead of UTF-16 (where surrogate pairs encode codepoints above the BMP)?


        Best regards,
        Tony.
        --
        A public debt is a kind of anchor in the storm; but if the anchor be
        too heavy for the vessel, she will be sunk by that very weight which
        was intended for her preservation.
        -- Colton

        --~--~---------~--~----~------------~-------~--~----~
        You received this message from the "vim_dev" maillist.
        For more information, visit http://www.vim.org/maillist.php
        -~----------~----~----~----~------~----~------~--~---
      • JiaYanwei
        Hello Tony, It s really to be the similar problem, but this one only arise under Windows operating system, the UTF-16le BOM problem is platform independence. I
        Message 3 of 7 , Oct 22, 2008
          Hello Tony,

          It's really to be the similar problem, but this one only arise under Windows
          operating system, the UTF-16le BOM problem is platform independence. I was 
          uncertain wherher a combined patch was convenient.

          On 2008-10-22 23:21:11, Tony Mechelynck wrote:
          > I expect this is related with the UTF-16le BOM problem you noticed this
          > past Saturday. Maybe a combined patch would be OK, since in both cases,
          > the problem involves using UCS-2 (where surrogates are undefined)
          > instead of UTF-16 (where surrogate pairs encode codepoints above the BMP)? 

          Best regards,
          Yanwei
          --~--~---------~--~----~------------~-------~--~----~
          You received this message from the "vim_dev" maillist.
          For more information, visit http://www.vim.org/maillist.php
          -~----------~----~----~----~------~----~------~--~---

        • JiaYanwei
          Oh, I had made a mistake, I want to say They re really similar problems the first sentence. On 2008-10-23 00:16:20, JiaYanwei ...
          Message 4 of 7 , Oct 22, 2008
            Oh, I had made a mistake, I want to say "They're really  similar problems"
            the first sentence.

            On 2008-10-23 00:16:20, JiaYanwei
            > Hello Tony,
            >
            > It's really to be the similar problem, but this one only arise under Windows
            > operating system, the UTF-16le BOM problem is platform independence. I was 
            > uncertain wherher a combined patch was convenient.
            --~--~---------~--~----~------------~-------~--~----~
            You received this message from the "vim_dev" maillist.
            For more information, visit http://www.vim.org/maillist.php
            -~----------~----~----~----~------~----~------~--~---

          • Tony Mechelynck
            ... Actually, to most Windows programs Unicode actually means UTF-16le with BOM ; however they usually can read (but not write) other Unicode encodings if a
            Message 5 of 7 , Oct 22, 2008
              On 22/10/08 18:25, JiaYanwei wrote:
              > **Oh, I had made a mistake, I want to say "They're really similar
              > problems"
              > the first sentence.
              >
              > On 2008-10-23 00:16:20, JiaYanwei
              > > Hello Tony,
              > >
              > > It's really to be the similar problem, but this one only arise under
              > Windows
              > > operating system, the UTF-16le BOM problem is platform independence.
              > I was
              > > uncertain wherher a combined patch was convenient.

              Actually, to most Windows programs "Unicode" actually means "UTF-16le
              with BOM"; however they usually can read (but not write) other Unicode
              encodings if a BOM is present. So IIUC it's actually another aspect of
              the same problem.

              However, Bram has taken the custom to separate patches for "Unix"
              (including common), "extra" (including Windows and VMS) and "language"
              sources, so these two patches should perhaps be kept separate after all.


              Best regards,
              Tony.
              --
              If all the world's economists were laid end to end, we wouldn't reach a
              conclusion.
              -- William Baumol

              --~--~---------~--~----~------------~-------~--~----~
              You received this message from the "vim_dev" maillist.
              For more information, visit http://www.vim.org/maillist.php
              -~----------~----~----~----~------~----~------~--~---
            • Bram Moolenaar
              ... Looks good, thanks. I ll include it later. -- hundred-and-one symptoms of being an internet addict: 166. You have been on your computer soo long that you
              Message 6 of 7 , Nov 2, 2008
                Yanwei wrote:

                > When interchanging data with Windows such as clipboard operation, gvim will
                > convert the text into UCS-2 encoding, but different from UTF-16, UCS-2 can't
                > encode non-BMP characters.
                >
                > For example, when paste a non-BMP character U+248BB from Windows clipboard,
                > it will insert two separated characters <d852> <dcbb>. It is caused by the
                > function ucs2_to_utf8() in src/os_mswin.c, which treates the surrogate pairs
                > as separated unicode characters, and convert it into bad UTF-8 sequence
                > 0xED 0xA1 0x92 0xED 0xB2 0xBB -- the correct UTF-8 sequence should be
                > 0xF0 0xA4 0xA2 0xBB.
                >
                > Similarly, when copy a non-BMP character U+248BB into Windows clipboard, the
                > content of clipboard will be U+48BB, because the function utf8_to_ucs2()
                > in src/os_mswin.c will cast the integer 0x248BB into a short integer 0x48BB.
                >
                > The attachment is a patch. The surrogate pairs handling has been add into the
                > two functions mentioned above. This make the non-BMP characters can be
                > correctly interchanged with Windows clipboard as I had tested:
                > Non-BMP character paste from/copy into Windows clipboard
                > +----------+--------------------------------+------------------------+
                > | | WindowsXP with GB18030 support | Windows 98 |
                > +----------+--------------------------------+------------------------+
                > | editing | before patch works bad | before patch works bad |
                > | UTF-* or | after patch works OK | after patch works OK |
                > | UCS-4* | | |
                > | text | | |
                > +----------+--------------------------------+------------------------+
                > | editing | before patch works bad | ( can not edit |
                > | GB18030 | after patch works OK | GB18030 text ) |
                > | text | | |
                > +----------+--------------------------------+------------------------+
                > B.T.W.: It seems better to replace the functions name mentioned above with
                > "utf16_to_utf8" and "utf8_to_utf16", I think.

                Looks good, thanks. I'll include it later.

                --
                hundred-and-one symptoms of being an internet addict:
                166. You have been on your computer soo long that you didn't realize
                you had grandchildren.

                /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
                \\\ download, build and distribute -- http://www.A-A-P.org ///
                \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

                --~--~---------~--~----~------------~-------~--~----~
                You received this message from the "vim_dev" maillist.
                For more information, visit http://www.vim.org/maillist.php
                -~----------~----~----~----~------~----~------~--~---
              • JiaYanwei
                It s a pleasure for me. :-) ... Best?regards, Yanwei. -- --~--~---------~--~----~------------~-------~--~----~ You received this message from the vim_dev
                Message 7 of 7 , Nov 2, 2008
                  It's a pleasure for me. :-)

                  On 2008-11-02 21:56:20, Bram Moolenaar wrote:
                  
                  > >Looks good, thanks.  I'll include it later.
                  Best regards,
                  Yanwei.
                  --



                  网易邮箱10周年,技术见证辉煌
                  --~--~---------~--~----~------------~-------~--~----~
                  You received this message from the "vim_dev" maillist.
                  For more information, visit http://www.vim.org/maillist.php
                  -~----------~----~----~----~------~----~------~--~---

                Your message has been successfully submitted and would be delivered to recipients shortly.