Loading ...
Sorry, an error occurred while loading the content.

Vim multibyte support for non-utf8 encodings

Expand Messages
  • Hye-Shik Chang
    Hi, The current version of vim doesn t handle non-utf8 multibyte encodings such as EUC and/or GBK in FreeBSD. Cursor moves around weird places inside a
    Message 1 of 3 , Jul 15, 2005
    View Source
    • 0 Attachment
      Hi,

      The current version of vim doesn't handle non-utf8 multibyte encodings
      such as EUC and/or GBK in FreeBSD. Cursor moves around weird places
      inside a character and the last character on each lines disappears
      sometimes.

      This problem is due to vim's dependency to undefined behavior of
      mblen(3). Looking vim's source code mbyte.c:653, the routine assumes
      that mblen(3) isn't stateful. On glibc or Solaris libc, mblen(3)
      does not change the internal state when EILSEQ or EINVAL is occurred.
      But FreeBSD libc changes the internal state even when it meets an
      error. The mblen(3) behavior is undefined in POSIX [1] and none
      of each libc implementations are wrong. So I think it's required
      to reset multibyte states before a mblen(3) call to work the routine
      free from implementation.

      My patch is attached.

      [1] http://www.opengroup.org/onlinepubs/009695399/functions/mblen.html

      Hye-Shik
    • Bram Moolenaar
      ... The behavior of mblen() on various systems has always been a bit unclear to me. Your remark makes a lot of sense, but I wonder why nobody had this problem
      Message 2 of 3 , Jul 16, 2005
      View Source
      • 0 Attachment
        Hye-Shik Chang wrote:

        > The current version of vim doesn't handle non-utf8 multibyte encodings
        > such as EUC and/or GBK in FreeBSD. Cursor moves around weird places
        > inside a character and the last character on each lines disappears
        > sometimes.
        >
        > This problem is due to vim's dependency to undefined behavior of
        > mblen(3). Looking vim's source code mbyte.c:653, the routine assumes
        > that mblen(3) isn't stateful. On glibc or Solaris libc, mblen(3)
        > does not change the internal state when EILSEQ or EINVAL is occurred.
        > But FreeBSD libc changes the internal state even when it meets an
        > error. The mblen(3) behavior is undefined in POSIX [1] and none
        > of each libc implementations are wrong. So I think it's required
        > to reset multibyte states before a mblen(3) call to work the routine
        > free from implementation.
        >
        > My patch is attached.
        >
        > [1] http://www.opengroup.org/onlinepubs/009695399/functions/mblen.html

        > --- mbyte.c.orig Fri Apr 23 17:44:36 2004
        > +++ mbyte.c Thu May 12 08:48:35 2005
        > @@ -650,6 +650,7 @@
        > * where mblen() returns 0 for invalid character.
        > * Therefore, following condition includes 0.
        > */
        > + (void)mblen(NULL, 0);
        > if (mblen(buf, (size_t)1) <= 0)
        > n = 2;
        > else

        The behavior of mblen() on various systems has always been a bit unclear
        to me. Your remark makes a lot of sense, but I wonder why nobody had
        this problem before.

        I'll include this now in Vim 7 and await further comments. Hopefully
        there is no mblen() implementation that crashes when invoked with a NULL
        pointer.

        --
        CUSTOMER: Well, can you hang around a couple of minutes? He won't be
        long.
        MORTICIAN: Naaah, I got to go on to Robinson's -- they've lost nine today.
        CUSTOMER: Well, when is your next round?
        MORTICIAN: Thursday.
        DEAD PERSON: I think I'll go for a walk.
        The Quest for the Holy Grail (Monty Python)

        /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
        /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
        \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
        \\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html ///
      • Hye-Shik Chang
        ... [snip] ... In fact, many of Japanese FreeBSD users seems to have been suffered from the problem:
        Message 3 of 3 , Jul 16, 2005
        View Source
        • 0 Attachment
          On Sat, Jul 16, 2005 at 12:44:34PM +0200, Bram Moolenaar wrote:
          >
          > Hye-Shik Chang wrote:
          >
          > > The current version of vim doesn't handle non-utf8 multibyte encodings
          > > such as EUC and/or GBK in FreeBSD. Cursor moves around weird places
          > > inside a character and the last character on each lines disappears
          > > sometimes.
          [snip]
          >
          > The behavior of mblen() on various systems has always been a bit unclear
          > to me. Your remark makes a lot of sense, but I wonder why nobody had
          > this problem before.
          >

          In fact, many of Japanese FreeBSD users seems to have been suffered
          from the problem:

          http://www.queen.ne.jp/iMA/showmdir.pl?ports-jp=Current&num=14694&link=20040430015955%2eGA52106%25st%40be%2eto
          (even if you can't read japanese, you still can discover some
          alphabets on the page. :)

          I didn't aware of the problem because I'm using UTF-8 locale, but
          few friends of mine asked a help to me.

          > I'll include this now in Vim 7 and await further comments. Hopefully
          > there is no mblen() implementation that crashes when invoked with a NULL
          > pointer.

          Thanks for applying the fix! I think the fix will not harm any
          platform. mblen(NULL, 0); is clearly defined in POSIX as a reset
          method.


          Thanks,
          Hye-Shik
        Your message has been successfully submitted and would be delivered to recipients shortly.