Loading ...
Sorry, an error occurred while loading the content.

[BUG] Ambiguous-width character handling

Expand Messages
  • Autrijus Tang
    Greetings. I m a happy user of VIM s multi-byte editing environment under the X terminal with mlterm , and combined they are pleasantly intelligent about
    Message 1 of 7 , Sep 18, 2002
    • 0 Attachment
      Greetings. I'm a happy user of VIM's multi-byte editing environment under the
      X terminal with 'mlterm', and combined they are pleasantly intelligent about
      Chinese odds and ends. My current settings are:

      set termencoding=big5
      set encoding=utf-8
      set fileencodings=big5,utf-8,big5-hkscs,gbk,euc-jp,euc-kr,utf-bom,iso8859-1

      However, I have run into a bug concerning "Ambiguous width characters" within
      Big5 files. For example, the Big5 character \xA2\x69, when converted to its
      utf-8 encoding, would be:

      2588;FULL BLOCK;So;0;ON;;;;;N;;;;;

      Which is an 'A' (ambiguous-width) characters in EastAsianWidth.txt:

      2588;A # FULL BLOCK

      Whereas VIM correctly treats normal Chinese characters and punctuations as
      occupying two on-screen columns, it considers U+2588 as a single-width
      character and seriously disrupts the display.

      To observe this bug, move the cursor to the end of the following line
      by pressing '$':

      ???1234???1234???1234???1234???1234???1234???1234???1234???1234???1234???1234

      You will notice that it stops at a point where there are still 9 characters
      displayed in its right. In 'GVIM', it displays incorrectly by treating the
      full-width U+2588 as a single-width character with no corresponding fonts.
      Both behaviours are arguably erroneous.

      According to http://www.unicode.org/unicode/reports/tr11/#Ambiguous
      (Unicode Technical Report #11), the recommended way to handle ambiguous-width
      characters are:

      When mapping Unicode to legacy character encodings:
      * Ambiguous Unicode characters always map to full-width characters
      * Ambiguous Unicode characters always map to regular (narrow) characters
      in non-East Asian legacy character encodings

      When processing or displaying data:
      * Ambiguous characters behave like wide or narrow characters depending
      on context (language tag, script identification, associated font, source
      of data, or explicit markup; all can provide the context)

      Therefore, may I suggest a new buffer-local option, 'ambiguouswidth', which
      can have either of the following values:

      'h' denotes half-width (current) behaviour for ambiguous-width characters
      'f' denotes full-width behaviour for ambiguous-width characters
      'a' denotes automatic handling: full-width if either termencoding _or_
      fileencoding is one of [ uhc, johab, gbk, euc-cn, big5, big5-hkscs ],
      and half-width otherwise.

      The 'a' option is good-to-have but not crucial to the operation. All I want
      is a way to override VIM's current behaviour.

      Being not very familiar with VIM's internals and possessing less-than-competent
      C skills, I'd appreciate if somebody can implement this idea, or at least
      point me to the relevant portion within the source so I can hack on it.

      Thanks,
      /Autrijus/
    • Autrijus Tang
      ... Oops, make that: ¢i1234¢i1234¢i1234¢i1234¢i1234¢i1234¢i1234¢i1234¢i1234¢i1234¢i1234 ... With a little digging I found this entry in the :help
      Message 2 of 7 , Sep 18, 2002
      • 0 Attachment
        On Thu, Sep 19, 2002 at 07:45:37AM +0800, Autrijus Tang wrote:
        > To observe this bug, move the cursor to the end of the following line
        > by pressing '$':
        >
        > ???1234???1234???1234???1234???1234???1234???1234???1234???1234???1234???1234

        Oops, make that:

        █1234█1234█1234█1234█1234█1234█1234█1234█1234█1234█1234

        > Therefore, may I suggest a new buffer-local option, 'ambiguouswidth', which
        > can have either of the following values:

        With a little digging I found this entry in the :help todo entry:

        8 Some UTF-8 have an ambiguous width (single or double).
        Should inspect the font to find out what will be displayed. (Long)

        However I'm a little perplexed in how would you inspect the font under the
        console (with big5con, imcce or other multibyte vga terminals), or within
        the X terminal? It seems to me that the "inspect the font to find out"
        way only works in GVIM; please correct me if I'm mistaken.

        Thanks,
        /Autrijus/
      • Tony Mechelynck
        ... From: Autrijus Tang To: Cc: Sent: Thursday, September 19, 2002 1:45 AM Subject: [BUG]
        Message 3 of 7 , Sep 18, 2002
        • 0 Attachment
          ----- Original Message -----
          From: "Autrijus Tang" <autrijus@...>
          To: <vim-multibyte@...>
          Cc: <whiteg@...>
          Sent: Thursday, September 19, 2002 1:45 AM
          Subject: [BUG] Ambiguous-width character handling

          Greetings. I'm a happy user of VIM's multi-byte editing environment under
          the
          X terminal with 'mlterm', and combined they are pleasantly intelligent about
          Chinese odds and ends. My current settings are:

          set termencoding=big5
          set encoding=utf-8
          set
          fileencodings=big5,utf-8,big5-hkscs,gbk,euc-jp,euc-kr,utf-bom,iso8859-1

          [...]

          I don't feel competent to address most of the content of your message, but
          about the above:

          - AFAIK, vim doesn't know "utf-bom" as an encoding
          - You can use "ucs-bom" to be able to recognise a Byte Order Mark in
          your input files, but it must come before any other Unicode encoding,
          including "utf-8", else it will not work properly.

          see :help 'fileencodings'

          Regards,
          Tony.
        • Bram Moolenaar
          Autrijus Tang wrote: [about ambiguous characters being single width while the terminal displays them as double width] Note that Vim only supports one font for
          Message 4 of 7 , Sep 19, 2002
          • 0 Attachment
            Autrijus Tang wrote:

            [about ambiguous characters being single width while the terminal
            displays them as double width]

            Note that Vim only supports one font for the whole Vim window. I don't
            expect that a single font has two glyphs for the same character,
            depending on the context. Therefore the choice for whether an ambiguous
            character is single or double width should match the font.

            If we can't obtain the info from the font, an option could be used.

            > With a little digging I found this entry in the :help todo entry:
            >
            > 8 Some UTF-8 have an ambiguous width (single or double).
            > Should inspect the font to find out what will be displayed. (Long)
            >
            > However I'm a little perplexed in how would you inspect the font under the
            > console (with big5con, imcce or other multibyte vga terminals), or within
            > the X terminal? It seems to me that the "inspect the font to find out"
            > way only works in GVIM; please correct me if I'm mistaken.

            How does the terminal know how wide a character is? I suppose any Asian
            terminal emulator will prefer double-width characters for most
            characters.

            There are several alternatives:
            1. Add an option that specifies the width of all ambiguous characters:
            single or double width.
            2. Try obtaining the width from the font. Won't work for terminal
            emulators.
            3. A combination: use the option to specify single, double or auto,
            where auto works like the second alternative.

            A user might set the option in his .vimrc based on the name of the
            terminal:

            if &term =~ "big5"
            set ambiwidth=double
            else
            set ambiwidth=single
            endif

            --
            The goal of science is to build better mousetraps.
            The goal of nature is to build better mice.

            /// Bram Moolenaar -- Bram@... -- http://www.moolenaar.net \\\
            /// Creator of Vim - Vi IMproved -- http://www.vim.org \\\
            \\\ Project leader for A-A-P -- http://www.a-a-p.org ///
            \\\ Lord Of The Rings helps Uganda - http://iccf-holland.org/lotr.html ///
          • Noah Levitt
            ... What about guifont and guifontwide? The fonts I use have some overlap. Incidentally, Autrijus s sample line gave me no problems in an utf8 xterm. It
            Message 5 of 7 , Sep 19, 2002
            • 0 Attachment
              On Thu, Sep 19, 2002 at 21:43:00 +0200, Bram Moolenaar wrote:
              >
              > Note that Vim only supports one font for the whole Vim window. I don't
              > expect that a single font has two glyphs for the same character,
              > depending on the context. Therefore the choice for whether an ambiguous
              > character is single or double width should match the font.

              What about guifont and guifontwide? The fonts I use have
              some overlap.

              Incidentally, Autrijus's sample line gave me no problems in
              an utf8 xterm. It treated the characters as single-width.
              xterm uses Markus Kuhn's wcwidth, I believe.

              Noah
            • Bram Moolenaar
              ... That has a chicken-egg problem: the choice between the two fonts is made based on the width of a character. There could be a test if a glyph for a
              Message 6 of 7 , Sep 19, 2002
              • 0 Attachment
                Noah Levitt wrote:

                > On Thu, Sep 19, 2002 at 21:43:00 +0200, Bram Moolenaar wrote:
                > >
                > > Note that Vim only supports one font for the whole Vim window. I don't
                > > expect that a single font has two glyphs for the same character,
                > > depending on the context. Therefore the choice for whether an ambiguous
                > > character is single or double width should match the font.
                >
                > What about guifont and guifontwide? The fonts I use have
                > some overlap.

                That has a chicken-egg problem: the choice between the two fonts is made
                based on the width of a character. There could be a test if a glyph for
                a character is available, but that's complicated.

                > Incidentally, Autrijus's sample line gave me no problems in
                > an utf8 xterm. It treated the characters as single-width.
                > xterm uses Markus Kuhn's wcwidth, I believe.

                The function I use has the same source, thus it's no surprise Vim and
                Xterm work well together. The problem probably only exists on Asian
                terminals.

                --
                The Feynman problem solving Algorithm:
                1) Write down the problem
                2) Think real hard
                3) Write down the answer

                /// Bram Moolenaar -- Bram@... -- http://www.moolenaar.net \\\
                /// Creator of Vim - Vi IMproved -- http://www.vim.org \\\
                \\\ Project leader for A-A-P -- http://www.a-a-p.org ///
                \\\ Lord Of The Rings helps Uganda - http://iccf-holland.org/lotr.html ///
              • Autrijus Tang
                ... Yes, and I d think that it is somehow too smart for Vim to probe the glyph s width -- what if the guifont has the narrow glyph, and guifontwide has the
                Message 7 of 7 , Sep 19, 2002
                • 0 Attachment
                  On Thu, Sep 19, 2002 at 10:37:04PM +0200, Bram Moolenaar wrote:
                  > Noah Levitt wrote:
                  > > On Thu, Sep 19, 2002 at 21:43:00 +0200, Bram Moolenaar wrote:
                  > > > Note that Vim only supports one font for the whole Vim window. I don't
                  > > > expect that a single font has two glyphs for the same character,
                  > > > depending on the context. Therefore the choice for whether an ambiguous
                  > > > character is single or double width should match the font.
                  > > What about guifont and guifontwide? The fonts I use have
                  > > some overlap.
                  > That has a chicken-egg problem: the choice between the two fonts is made
                  > based on the width of a character. There could be a test if a glyph for
                  > a character is available, but that's complicated.

                  Yes, and I'd think that it is somehow "too smart" for Vim to probe
                  the glyph's width -- what if the guifont has the narrow glyph, and
                  guifontwide has the East Asian-fullwidth glyph?

                  A separate option, maybe probed initially be some heuristic
                  (but not neccessary), is IMHO more natural.

                  > > Incidentally, Autrijus's sample line gave me no problems in
                  > > an utf8 xterm. It treated the characters as single-width.
                  > > xterm uses Markus Kuhn's wcwidth, I believe.
                  > The function I use has the same source, thus it's no surprise Vim and
                  > Xterm work well together. The problem probably only exists on Asian
                  > terminals.

                  Or rather, only existing on East Asian fonts, like the one I use here:
                  "ar pl mingti2l big5-iso10646-1;"

                  Switching to other fonts can surely match the single-width results
                  given by wcwidth, but since text files prepared other unicode/big5
                  editors will assume double-width layout, the resulting formatting
                  and display will be incorrect from the author's perspective.

                  Thanks,
                  /Autrijus/
                Your message has been successfully submitted and would be delivered to recipients shortly.