Loading ...
Sorry, an error occurred while loading the content.
 

Re: Multibyte bugs

Expand Messages
  • Bram Moolenaar
    ... Why do you expect CTRL-K to produce 0xa0? According to http://www.faqs.org/rfcs/rfc1345.html it s 0xe000. ... Why would it be a
    Message 1 of 8 , Apr 10, 2010
      Tony Mechelynck wrote:

      > 1. (Minor bug): On this system (gvim 7.2.411, Huge version with
      > GTK2-GNOME GUI), typing Ctrl-K in Insert mode followed by two spaces
      > doesn't give the expected result: instead of U+00A0 ("Alt-space", the
      > non-breaking space) I get U+E000, a CJK character. Ctrl-K NS works
      > correctly.

      Why do you expect CTRL-K <space> <space> to produce 0xa0? According to
      http://www.faqs.org/rfcs/rfc1345.html it's 0xe000.

      > 2. U+E000 is displayed in gvim as CJK halfwidth. Shouldn't it be fullwidth?

      Why would it be a double-width character? In
      http://unicode.org/Public/UNIDATA/EastAsianWidth.txt it's marked as
      "private use".

      > 3. "\<Char-nnnn>" gives wrong results for some Unicode codepoints. I
      > tried to find examples and counterexamples, as follows (in the comment
      > after the :echo statements, the UTF-8 expansion in hex):
      >
      > :echo "«\<Char-0x40>" | " 40
      > «@»
      > :echo "«\<Char-0x80>" | " C2 80
      > «<80><fe>X»
      > :echo "«\<Char-0x100>»" | " C4 80
      > «Ā<fe>X»
      > :echo "«\<Char-0x101>»" | " C4 81
      > Â«Ä Â»
      > :echo "«\<Char-0x180>»" | " C6 80
      > «ƀ<fe>X»
      > :echo "«\<Char-0x190>»" | " C6 90
      > Â«Æ Â»
      > :echo "«\<Char-0x1A0>»" | " C6 A0
      > Â«Æ Â»
      > :echo "«\<Char-0x1C0>»" | " C7 80
      > «ǀ<fe>X»
      > :echo "«\<Char-0x4E00>»" | " E4 B8 80
      > «一<fe>X»
      > :echo "«\<Char-0x4E01>»" | " E4 B8 81
      > Â«ä¸ Â»
      > :echo "«\<Char-0x4E20>»" | " E4 B8 A0
      > Â«ä¸ Â»
      > :echo "«\<Char-0x4E40>»" | " E4 B9 80
      > «乀<fe>X»
      > :echo "«\<Char-0xE000>»" | " EE 80 80
      > «<ee><80><fe>X<80><fe>X»
      > :echo "«\<Char-57344>»" | " EE 80 80
      > «<ee><80><fe>X<80><fe>X»
      > :echo "«\<Char-0xE001>»" | " EE 80 81
      > «<ee><80><fe>X<81>»"
      > :echo "«\<Char-0xE040>»" | " EE 81 80
      > Â«î €<fe>X»
      >
      > This seems to indicate that the extra bytes 0xFE 0x58 appear after any
      > 0x80 in the UTF-8 expansion of the character. (I added the « »
      > characters to "bound" the display so that any extra whitespace would be
      > visible but they change nothing to the bug.)

      The form "\<xxx>" is for special keys, not characters. For the character
      itself use \x or \u or \U. See ":help expr-string".
      The special keys are escaped for use in a mapping.

      > The bug does not occur after Ctrl-V u in Insert mode or when using
      > <Char-...> in an Insert-mode mapping. It does when using "\<Char-...>"
      > in other commands than :echo. Note the following:
      >
      > :let j = "\<Char-0xE000>"
      > :let j
      > j <ee><80><fe>X<80><fe>X
      > i<Ctrl-R>=j<Enter>
      > î<t_þ>X<t_þ>X
      >
      > (where <Ctrl-R> and <Enter> are one keystroke each, not counting
      > modifiers). Apparently gvim tries to interpret 0x80 0xFE as a "special
      > key", and "resolves" it (incorrectly) as <t_þ>.
      >
      > Two very big files were loaded when I first noticed bug #3, but
      > restarting gvim without them reproduced the bug again with the same
      > spurious bytes.

      --
      SUPERIMPOSE "England AD 787". After a few more seconds we hear hoofbeats in
      the distance. They come slowly closer. Then out of the mist comes KING
      ARTHUR followed by a SERVANT who is banging two half coconuts together.
      "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

      /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
      /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
      \\\ download, build and distribute -- http://www.A-A-P.org ///
      \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

      --
      You received this message from the "vim_multibyte" maillist.
      For more information, visit http://www.vim.org/maillist.php

      To unsubscribe, reply using "remove me" as the subject.
    • Tony Mechelynck
      ... When {char} is 0x20 i.e. , the above tells me that CTRL-K gives 0xA0 i.e. the non-breaking space, which is useful to enter the
      Message 2 of 8 , Apr 10, 2010
        On 10/04/10 23:43, Bram Moolenaar wrote:
        >
        > Tony Mechelynck wrote:
        >
        >> 1. (Minor bug): On this system (gvim 7.2.411, Huge version with
        >> GTK2-GNOME GUI), typing Ctrl-K in Insert mode followed by two spaces
        >> doesn't give the expected result: instead of U+00A0 ("Alt-space", the
        >> non-breaking space) I get U+E000, a CJK character. Ctrl-K NS works
        >> correctly.
        >
        > Why do you expect CTRL-K<space> <space> to produce 0xa0? According to
        > http://www.faqs.org/rfcs/rfc1345.html it's 0xe000.

        Because of the following paragraph at lines 99-100 of digraph.txt:

        ----8<----
        > For CTRL-K, there is one general digraph: CTRL-K <Space> {char} will enter
        > {char} with the highest bit set. You can use this to enter meta-characters.
        ---->8----

        When {char} is 0x20 i.e. <Space>, the above tells me that CTRL-K <Space>
        <Space> gives 0xA0 i.e. the non-breaking space, which is useful to enter
        the "meta-character" Meta-Space if I don't remember the NS digraph. If
        U+E000 is a "private use" character, I don't see why it needs a digraph
        of its own anyway.

        On reading that RFC, which states in its beginning paragraph that it has
        no normative value whatsoever, I see (at the very end of section 3)
        quite a number of digraphs and trigraphs assigned to U+E000 to U+E028,
        in what Unicode calls a "private use area": see for instance the very
        start of http://www.unicode.org/charts/pdf/UE000.pdf:

        ----8<----
        Private Use Area
        Range: E000–F8FF
        The Private Use Area does not contain any character assignments,
        consequently no character code charts or namelists are provided for this
        area.
        ---->8----

        At least some of the characters listed there in the RFC have a different
        Unicode codepoint assigned to them, but maybe Unicode assigned them
        after the RFC (dated June 1992) was published. Personally I have strong
        doubts as to the usefulness of any Vim digraph for a "private use"
        character. U+E000 is listed as "indicates unfinished (Mnemonic)". I'm
        not sure what that means, unless maybe that a blank space in a charset
        chart (further down in the same RFC) indicates that the chart is unfinished?

        >
        >> 2. U+E000 is displayed in gvim as CJK halfwidth. Shouldn't it be fullwidth?
        >
        > Why would it be a double-width character? In
        > http://unicode.org/Public/UNIDATA/EastAsianWidth.txt it's marked as
        > "private use".

        Ah, I see. FWIW my usual 'guifont' has a glyph for it, which AFAICT is a
        fullwidth CJK glyph. OTOH the Unihan database does not mention it.

        >
        >> 3. "\<Char-nnnn>" gives wrong results for some Unicode codepoints.
        [...]
        >
        > The form "\<xxx>" is for special keys, not characters. For the character
        > itself use \x or \u or \U. See ":help expr-string".
        > The special keys are escaped for use in a mapping.

        The example given at |expr-string| is "\<C-W>" which is the "<control>"
        character defined by ASCII as 0x17 ("\x17") and by Unicode as U+0017
        ("\u0017"), not a "special" non-ASCII key like <F8>, <Home> or
        <PageDown>. I had always thought that _every_ <> name could be used in a
        double-quoted string with a backslash prefix, and indeed I have verified
        that it works for all the <Char-nnnn> or <Char-0xnnnn> that I tested
        _except_ those whose UTF-8 expansion includes either or both of the
        bytes 0x80 and 0x9B, in which case two spurious bytes are inserted
        immediately after every occurrence of a 0x80 or 0x9B byte.

        If this bug is WONTFIX, I suggest to mention explicitly at the bottom of
        the list under |expr-quote| that the \<xxx> form does not apply if xxx
        is Char-nnnn or Char-0xnnnn.


        Best regards,
        Tony.
        --
        "To whoever finds this note -
        I have been imprisoned by my father who wishes me to marry
        against my will. Please please please please come and rescue me.
        I am in the tall tower of Swamp Castle."
        SIR LAUNCELOT's eyes light up with holy inspiration.
        "Monty Python and the Holy Grail" PYTHON (MONTY)
        PICTURES LTD

        --
        You received this message from the "vim_multibyte" maillist.
        For more information, visit http://www.vim.org/maillist.php
      • Bram Moolenaar
        ... Ah, OK. ... It s weird that digraphs are defined for an area that doesn t have characters assigned to it. I wonder what happened here. Perhaps this
        Message 3 of 8 , Apr 11, 2010
          Tony Mechelynck wrote:

          > >> 1. (Minor bug): On this system (gvim 7.2.411, Huge version with
          > >> GTK2-GNOME GUI), typing Ctrl-K in Insert mode followed by two spaces
          > >> doesn't give the expected result: instead of U+00A0 ("Alt-space", the
          > >> non-breaking space) I get U+E000, a CJK character. Ctrl-K NS works
          > >> correctly.
          > >
          > > Why do you expect CTRL-K<space> <space> to produce 0xa0? According to
          > > http://www.faqs.org/rfcs/rfc1345.html it's 0xe000.
          >
          > Because of the following paragraph at lines 99-100 of digraph.txt:
          >
          > ----8<----
          > > For CTRL-K, there is one general digraph: CTRL-K <Space> {char} will enter
          > > {char} with the highest bit set. You can use this to enter meta-characters.
          > ---->8----
          >
          > When {char} is 0x20 i.e. <Space>, the above tells me that CTRL-K <Space>
          > <Space> gives 0xA0 i.e. the non-breaking space, which is useful to enter
          > the "meta-character" Meta-Space if I don't remember the NS digraph. If
          > U+E000 is a "private use" character, I don't see why it needs a digraph
          > of its own anyway.

          Ah, OK.

          > On reading that RFC, which states in its beginning paragraph that it has
          > no normative value whatsoever, I see (at the very end of section 3)
          > quite a number of digraphs and trigraphs assigned to U+E000 to U+E028,
          > in what Unicode calls a "private use area": see for instance the very
          > start of http://www.unicode.org/charts/pdf/UE000.pdf:
          >
          > ----8<----
          > Private Use Area
          > Range: E000–F8FF
          > The Private Use Area does not contain any character assignments,
          > consequently no character code charts or namelists are provided for this
          > area.
          > ---->8----
          >
          > At least some of the characters listed there in the RFC have a different
          > Unicode codepoint assigned to them, but maybe Unicode assigned them
          > after the RFC (dated June 1992) was published. Personally I have strong
          > doubts as to the usefulness of any Vim digraph for a "private use"
          > character. U+E000 is listed as "indicates unfinished (Mnemonic)". I'm
          > not sure what that means, unless maybe that a blank space in a charset
          > chart (further down in the same RFC) indicates that the chart is unfinished?

          It's weird that digraphs are defined for an area that doesn't have
          characters assigned to it. I wonder what happened here. Perhaps this
          changed at some point in time? If we know the reason we may want to
          drop all the dibgraphs for 0xexxx.


          > >> 2. U+E000 is displayed in gvim as CJK halfwidth. Shouldn't it be fullwidth?
          > >
          > > Why would it be a double-width character? In
          > > http://unicode.org/Public/UNIDATA/EastAsianWidth.txt it's marked as
          > > "private use".
          >
          > Ah, I see. FWIW my usual 'guifont' has a glyph for it, which AFAICT is a
          > fullwidth CJK glyph. OTOH the Unihan database does not mention it.
          >
          > >
          > >> 3. "\<Char-nnnn>" gives wrong results for some Unicode codepoints.
          > [...]
          > >
          > > The form "\<xxx>" is for special keys, not characters. For the character
          > > itself use \x or \u or \U. See ":help expr-string".
          > > The special keys are escaped for use in a mapping.
          >
          > The example given at |expr-string| is "\<C-W>" which is the "<control>"
          > character defined by ASCII as 0x17 ("\x17") and by Unicode as U+0017
          > ("\u0017"), not a "special" non-ASCII key like <F8>, <Home> or
          > <PageDown>. I had always thought that _every_ <> name could be used in a
          > double-quoted string with a backslash prefix, and indeed I have verified
          > that it works for all the <Char-nnnn> or <Char-0xnnnn> that I tested
          > _except_ those whose UTF-8 expansion includes either or both of the
          > bytes 0x80 and 0x9B, in which case two spurious bytes are inserted
          > immediately after every occurrence of a 0x80 or 0x9B byte.
          >
          > If this bug is WONTFIX, I suggest to mention explicitly at the bottom of
          > the list under |expr-quote| that the \<xxx> form does not apply if xxx
          > is Char-nnnn or Char-0xnnnn.

          Yes.

          --
          SOLDIER: Where did you get the coconuts?
          ARTHUR: Through ... We found them.
          SOLDIER: Found them? In Mercea. The coconut's tropical!
          "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

          /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
          /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
          \\\ download, build and distribute -- http://www.A-A-P.org ///
          \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

          --
          You received this message from the "vim_multibyte" maillist.
          For more information, visit http://www.vim.org/maillist.php

          To unsubscribe, reply using "remove me" as the subject.
        • Tony Mechelynck
          On 11/04/10 16:33, Bram Moolenaar wrote: [...] ... [...] My guess is that when that RFC was drafted in 1992, some of the charsets they wanted to list used a
          Message 4 of 8 , Apr 11, 2010
            On 11/04/10 16:33, Bram Moolenaar wrote:
            [...]
            > It's weird that digraphs are defined for an area that doesn't have
            > characters assigned to it. I wonder what happened here. Perhaps this
            > changed at some point in time? If we know the reason we may want to
            > drop all the dibgraphs for 0xexxx.
            [...]

            My guess is that when that RFC was drafted in 1992, some of the charsets
            they wanted to list used a few characters which, at that time, weren't
            clearly assigned to one Unicode codepoint, and that the RFC authors
            arbitrarily (and maybe temporarily) placed these characters in a
            "private use area", which is the only place where "characters not yet
            assigned a Unicode codepoint" may go. This is only a guess, however. I'm
            not sure how many people are reading this (extremely low-volume) ML, but
            maybe someone knows the history of those mnemonics from RFC 1345 better
            than you and I do? If someone with that knowledge is reading this,
            please speak up.

            IMHO it makes no sense to have digraphs in Vim for "private use"
            characters. I propose to drop any of them that cannot be usefully
            reassigned to some "official" Unicode codepoint elsewhere. E000 to E028
            means forty-one codepoints, it ought not to be a big problem.


            Best regards,
            Tony.
            --
            LAUNCELOT: At last! A call! A cry of distress ...
            (he draws his sword, and turns to CONCORDE)
            Concorde! Brave, Concorde ... you shall not have died in vain!
            CONCORDE: I'm not quite dead, sir ...
            "Monty Python and the Holy Grail" PYTHON (MONTY)
            PICTURES LTD

            --
            You received this message from the "vim_multibyte" maillist.
            For more information, visit http://www.vim.org/maillist.php

            To unsubscribe, reply using "remove me" as the subject.
          • Bram Moolenaar
            ... Searching revealed a few proposals for these character ranges. And this page has a confusing summary:
            Message 5 of 8 , Apr 11, 2010
              Tony Mechelynck wrote:

              > On 11/04/10 16:33, Bram Moolenaar wrote:
              > [...]
              > > It's weird that digraphs are defined for an area that doesn't have
              > > characters assigned to it. I wonder what happened here. Perhaps this
              > > changed at some point in time? If we know the reason we may want to
              > > drop all the dibgraphs for 0xexxx.
              > [...]
              >
              > My guess is that when that RFC was drafted in 1992, some of the charsets
              > they wanted to list used a few characters which, at that time, weren't
              > clearly assigned to one Unicode codepoint, and that the RFC authors
              > arbitrarily (and maybe temporarily) placed these characters in a
              > "private use area", which is the only place where "characters not yet
              > assigned a Unicode codepoint" may go. This is only a guess, however. I'm
              > not sure how many people are reading this (extremely low-volume) ML, but
              > maybe someone knows the history of those mnemonics from RFC 1345 better
              > than you and I do? If someone with that knowledge is reading this,
              > please speak up.
              >
              > IMHO it makes no sense to have digraphs in Vim for "private use"
              > characters. I propose to drop any of them that cannot be usefully
              > reassigned to some "official" Unicode codepoint elsewhere. E000 to E028
              > means forty-one codepoints, it ought not to be a big problem.

              Searching revealed a few proposals for these character ranges. And
              this page has a confusing summary:
              http://en.wikibooks.org/wiki/Unicode/Character_reference/E000-EFFF
              "private use" but it does have a table with characters.

              Let's remove these digraphs. I can't imagine anyone is using them.

              --
              Clothes make the man. Naked people have little or no influence on society.
              -- Mark Twain (Samuel Clemens) (1835-1910)

              /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
              /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
              \\\ download, build and distribute -- http://www.A-A-P.org ///
              \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

              --
              You received this message from the "vim_multibyte" maillist.
              For more information, visit http://www.vim.org/maillist.php

              To unsubscribe, reply using "remove me" as the subject.
            • Tony Mechelynck
              ... [...] ... Yes; in my browser and with my usual font most (but not all) of them are CJK fullwidth ideograms and full-width counterparts of halfwidth math
              Message 6 of 8 , Apr 12, 2010
                On 11/04/10 17:33, Bram Moolenaar wrote:
                >
                > Tony Mechelynck wrote:
                [...]
                >> IMHO it makes no sense to have digraphs in Vim for "private use"
                >> characters. I propose to drop any of them that cannot be usefully
                >> reassigned to some "official" Unicode codepoint elsewhere. E000 to E028
                >> means forty-one codepoints, it ought not to be a big problem.
                >
                > Searching revealed a few proposals for these character ranges. And
                > this page has a confusing summary:
                > http://en.wikibooks.org/wiki/Unicode/Character_reference/E000-EFFF
                > "private use" but it does have a table with characters.

                Yes; in my browser and with my usual font most (but not all) of them are
                CJK fullwidth ideograms and full-width counterparts of halfwidth math
                symbols etc. A few are (halfwidth) Latin accented letters which even
                exist in Latin1 i.e. below U+0100 !!! For instance (in my browser)
                U+E023 to U+E081 look like duplicates of ASCII 0x21 to 0x7E in the same
                order. Note however the last sentence immediately before the table:

                «The repertoire seen with your computer's font will most likely not be
                the same as with other computers or fonts.»

                And indeed I see a different glyph for those codepoints in gvim with my
                usual 'guifont', which is not the same as my browser's usual serif and
                sans-serif fonts.

                >
                > Let's remove these digraphs. I can't imagine anyone is using them.
                >

                Neither can I.


                Best regards,
                Tony.
                --
                LAUNCELOT leaps into SHOT with a mighty cry and runs the GUARD
                through and
                hacks him to the floor. Blood. Swashbuckling music (perhaps).
                LAUNCELOT races through into the castle screaming.
                SECOND SENTRY: Hey!
                "Monty Python and the Holy Grail" PYTHON (MONTY)
                PICTURES LTD

                --
                You received this message from the "vim_multibyte" maillist.
                For more information, visit http://www.vim.org/maillist.php
              Your message has been successfully submitted and would be delivered to recipients shortly.