Loading ...
Sorry, an error occurred while loading the content.

Re: Multibyte bugs

Expand Messages
  • Tony Mechelynck
    ... When {char} is 0x20 i.e. , the above tells me that CTRL-K gives 0xA0 i.e. the non-breaking space, which is useful to enter the
    Message 1 of 8 , Apr 10, 2010
    • 0 Attachment
      On 10/04/10 23:43, Bram Moolenaar wrote:
      >
      > Tony Mechelynck wrote:
      >
      >> 1. (Minor bug): On this system (gvim 7.2.411, Huge version with
      >> GTK2-GNOME GUI), typing Ctrl-K in Insert mode followed by two spaces
      >> doesn't give the expected result: instead of U+00A0 ("Alt-space", the
      >> non-breaking space) I get U+E000, a CJK character. Ctrl-K NS works
      >> correctly.
      >
      > Why do you expect CTRL-K<space> <space> to produce 0xa0? According to
      > http://www.faqs.org/rfcs/rfc1345.html it's 0xe000.

      Because of the following paragraph at lines 99-100 of digraph.txt:

      ----8<----
      > For CTRL-K, there is one general digraph: CTRL-K <Space> {char} will enter
      > {char} with the highest bit set. You can use this to enter meta-characters.
      ---->8----

      When {char} is 0x20 i.e. <Space>, the above tells me that CTRL-K <Space>
      <Space> gives 0xA0 i.e. the non-breaking space, which is useful to enter
      the "meta-character" Meta-Space if I don't remember the NS digraph. If
      U+E000 is a "private use" character, I don't see why it needs a digraph
      of its own anyway.

      On reading that RFC, which states in its beginning paragraph that it has
      no normative value whatsoever, I see (at the very end of section 3)
      quite a number of digraphs and trigraphs assigned to U+E000 to U+E028,
      in what Unicode calls a "private use area": see for instance the very
      start of http://www.unicode.org/charts/pdf/UE000.pdf:

      ----8<----
      Private Use Area
      Range: E000–F8FF
      The Private Use Area does not contain any character assignments,
      consequently no character code charts or namelists are provided for this
      area.
      ---->8----

      At least some of the characters listed there in the RFC have a different
      Unicode codepoint assigned to them, but maybe Unicode assigned them
      after the RFC (dated June 1992) was published. Personally I have strong
      doubts as to the usefulness of any Vim digraph for a "private use"
      character. U+E000 is listed as "indicates unfinished (Mnemonic)". I'm
      not sure what that means, unless maybe that a blank space in a charset
      chart (further down in the same RFC) indicates that the chart is unfinished?

      >
      >> 2. U+E000 is displayed in gvim as CJK halfwidth. Shouldn't it be fullwidth?
      >
      > Why would it be a double-width character? In
      > http://unicode.org/Public/UNIDATA/EastAsianWidth.txt it's marked as
      > "private use".

      Ah, I see. FWIW my usual 'guifont' has a glyph for it, which AFAICT is a
      fullwidth CJK glyph. OTOH the Unihan database does not mention it.

      >
      >> 3. "\<Char-nnnn>" gives wrong results for some Unicode codepoints.
      [...]
      >
      > The form "\<xxx>" is for special keys, not characters. For the character
      > itself use \x or \u or \U. See ":help expr-string".
      > The special keys are escaped for use in a mapping.

      The example given at |expr-string| is "\<C-W>" which is the "<control>"
      character defined by ASCII as 0x17 ("\x17") and by Unicode as U+0017
      ("\u0017"), not a "special" non-ASCII key like <F8>, <Home> or
      <PageDown>. I had always thought that _every_ <> name could be used in a
      double-quoted string with a backslash prefix, and indeed I have verified
      that it works for all the <Char-nnnn> or <Char-0xnnnn> that I tested
      _except_ those whose UTF-8 expansion includes either or both of the
      bytes 0x80 and 0x9B, in which case two spurious bytes are inserted
      immediately after every occurrence of a 0x80 or 0x9B byte.

      If this bug is WONTFIX, I suggest to mention explicitly at the bottom of
      the list under |expr-quote| that the \<xxx> form does not apply if xxx
      is Char-nnnn or Char-0xnnnn.


      Best regards,
      Tony.
      --
      "To whoever finds this note -
      I have been imprisoned by my father who wishes me to marry
      against my will. Please please please please come and rescue me.
      I am in the tall tower of Swamp Castle."
      SIR LAUNCELOT's eyes light up with holy inspiration.
      "Monty Python and the Holy Grail" PYTHON (MONTY)
      PICTURES LTD

      --
      You received this message from the "vim_multibyte" maillist.
      For more information, visit http://www.vim.org/maillist.php
    • Bram Moolenaar
      ... Ah, OK. ... It s weird that digraphs are defined for an area that doesn t have characters assigned to it. I wonder what happened here. Perhaps this
      Message 2 of 8 , Apr 11, 2010
      • 0 Attachment
        Tony Mechelynck wrote:

        > >> 1. (Minor bug): On this system (gvim 7.2.411, Huge version with
        > >> GTK2-GNOME GUI), typing Ctrl-K in Insert mode followed by two spaces
        > >> doesn't give the expected result: instead of U+00A0 ("Alt-space", the
        > >> non-breaking space) I get U+E000, a CJK character. Ctrl-K NS works
        > >> correctly.
        > >
        > > Why do you expect CTRL-K<space> <space> to produce 0xa0? According to
        > > http://www.faqs.org/rfcs/rfc1345.html it's 0xe000.
        >
        > Because of the following paragraph at lines 99-100 of digraph.txt:
        >
        > ----8<----
        > > For CTRL-K, there is one general digraph: CTRL-K <Space> {char} will enter
        > > {char} with the highest bit set. You can use this to enter meta-characters.
        > ---->8----
        >
        > When {char} is 0x20 i.e. <Space>, the above tells me that CTRL-K <Space>
        > <Space> gives 0xA0 i.e. the non-breaking space, which is useful to enter
        > the "meta-character" Meta-Space if I don't remember the NS digraph. If
        > U+E000 is a "private use" character, I don't see why it needs a digraph
        > of its own anyway.

        Ah, OK.

        > On reading that RFC, which states in its beginning paragraph that it has
        > no normative value whatsoever, I see (at the very end of section 3)
        > quite a number of digraphs and trigraphs assigned to U+E000 to U+E028,
        > in what Unicode calls a "private use area": see for instance the very
        > start of http://www.unicode.org/charts/pdf/UE000.pdf:
        >
        > ----8<----
        > Private Use Area
        > Range: E000–F8FF
        > The Private Use Area does not contain any character assignments,
        > consequently no character code charts or namelists are provided for this
        > area.
        > ---->8----
        >
        > At least some of the characters listed there in the RFC have a different
        > Unicode codepoint assigned to them, but maybe Unicode assigned them
        > after the RFC (dated June 1992) was published. Personally I have strong
        > doubts as to the usefulness of any Vim digraph for a "private use"
        > character. U+E000 is listed as "indicates unfinished (Mnemonic)". I'm
        > not sure what that means, unless maybe that a blank space in a charset
        > chart (further down in the same RFC) indicates that the chart is unfinished?

        It's weird that digraphs are defined for an area that doesn't have
        characters assigned to it. I wonder what happened here. Perhaps this
        changed at some point in time? If we know the reason we may want to
        drop all the dibgraphs for 0xexxx.


        > >> 2. U+E000 is displayed in gvim as CJK halfwidth. Shouldn't it be fullwidth?
        > >
        > > Why would it be a double-width character? In
        > > http://unicode.org/Public/UNIDATA/EastAsianWidth.txt it's marked as
        > > "private use".
        >
        > Ah, I see. FWIW my usual 'guifont' has a glyph for it, which AFAICT is a
        > fullwidth CJK glyph. OTOH the Unihan database does not mention it.
        >
        > >
        > >> 3. "\<Char-nnnn>" gives wrong results for some Unicode codepoints.
        > [...]
        > >
        > > The form "\<xxx>" is for special keys, not characters. For the character
        > > itself use \x or \u or \U. See ":help expr-string".
        > > The special keys are escaped for use in a mapping.
        >
        > The example given at |expr-string| is "\<C-W>" which is the "<control>"
        > character defined by ASCII as 0x17 ("\x17") and by Unicode as U+0017
        > ("\u0017"), not a "special" non-ASCII key like <F8>, <Home> or
        > <PageDown>. I had always thought that _every_ <> name could be used in a
        > double-quoted string with a backslash prefix, and indeed I have verified
        > that it works for all the <Char-nnnn> or <Char-0xnnnn> that I tested
        > _except_ those whose UTF-8 expansion includes either or both of the
        > bytes 0x80 and 0x9B, in which case two spurious bytes are inserted
        > immediately after every occurrence of a 0x80 or 0x9B byte.
        >
        > If this bug is WONTFIX, I suggest to mention explicitly at the bottom of
        > the list under |expr-quote| that the \<xxx> form does not apply if xxx
        > is Char-nnnn or Char-0xnnnn.

        Yes.

        --
        SOLDIER: Where did you get the coconuts?
        ARTHUR: Through ... We found them.
        SOLDIER: Found them? In Mercea. The coconut's tropical!
        "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

        /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
        /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
        \\\ download, build and distribute -- http://www.A-A-P.org ///
        \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

        --
        You received this message from the "vim_multibyte" maillist.
        For more information, visit http://www.vim.org/maillist.php

        To unsubscribe, reply using "remove me" as the subject.
      • Tony Mechelynck
        On 11/04/10 16:33, Bram Moolenaar wrote: [...] ... [...] My guess is that when that RFC was drafted in 1992, some of the charsets they wanted to list used a
        Message 3 of 8 , Apr 11, 2010
        • 0 Attachment
          On 11/04/10 16:33, Bram Moolenaar wrote:
          [...]
          > It's weird that digraphs are defined for an area that doesn't have
          > characters assigned to it. I wonder what happened here. Perhaps this
          > changed at some point in time? If we know the reason we may want to
          > drop all the dibgraphs for 0xexxx.
          [...]

          My guess is that when that RFC was drafted in 1992, some of the charsets
          they wanted to list used a few characters which, at that time, weren't
          clearly assigned to one Unicode codepoint, and that the RFC authors
          arbitrarily (and maybe temporarily) placed these characters in a
          "private use area", which is the only place where "characters not yet
          assigned a Unicode codepoint" may go. This is only a guess, however. I'm
          not sure how many people are reading this (extremely low-volume) ML, but
          maybe someone knows the history of those mnemonics from RFC 1345 better
          than you and I do? If someone with that knowledge is reading this,
          please speak up.

          IMHO it makes no sense to have digraphs in Vim for "private use"
          characters. I propose to drop any of them that cannot be usefully
          reassigned to some "official" Unicode codepoint elsewhere. E000 to E028
          means forty-one codepoints, it ought not to be a big problem.


          Best regards,
          Tony.
          --
          LAUNCELOT: At last! A call! A cry of distress ...
          (he draws his sword, and turns to CONCORDE)
          Concorde! Brave, Concorde ... you shall not have died in vain!
          CONCORDE: I'm not quite dead, sir ...
          "Monty Python and the Holy Grail" PYTHON (MONTY)
          PICTURES LTD

          --
          You received this message from the "vim_multibyte" maillist.
          For more information, visit http://www.vim.org/maillist.php

          To unsubscribe, reply using "remove me" as the subject.
        • Bram Moolenaar
          ... Searching revealed a few proposals for these character ranges. And this page has a confusing summary:
          Message 4 of 8 , Apr 11, 2010
          • 0 Attachment
            Tony Mechelynck wrote:

            > On 11/04/10 16:33, Bram Moolenaar wrote:
            > [...]
            > > It's weird that digraphs are defined for an area that doesn't have
            > > characters assigned to it. I wonder what happened here. Perhaps this
            > > changed at some point in time? If we know the reason we may want to
            > > drop all the dibgraphs for 0xexxx.
            > [...]
            >
            > My guess is that when that RFC was drafted in 1992, some of the charsets
            > they wanted to list used a few characters which, at that time, weren't
            > clearly assigned to one Unicode codepoint, and that the RFC authors
            > arbitrarily (and maybe temporarily) placed these characters in a
            > "private use area", which is the only place where "characters not yet
            > assigned a Unicode codepoint" may go. This is only a guess, however. I'm
            > not sure how many people are reading this (extremely low-volume) ML, but
            > maybe someone knows the history of those mnemonics from RFC 1345 better
            > than you and I do? If someone with that knowledge is reading this,
            > please speak up.
            >
            > IMHO it makes no sense to have digraphs in Vim for "private use"
            > characters. I propose to drop any of them that cannot be usefully
            > reassigned to some "official" Unicode codepoint elsewhere. E000 to E028
            > means forty-one codepoints, it ought not to be a big problem.

            Searching revealed a few proposals for these character ranges. And
            this page has a confusing summary:
            http://en.wikibooks.org/wiki/Unicode/Character_reference/E000-EFFF
            "private use" but it does have a table with characters.

            Let's remove these digraphs. I can't imagine anyone is using them.

            --
            Clothes make the man. Naked people have little or no influence on society.
            -- Mark Twain (Samuel Clemens) (1835-1910)

            /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
            /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
            \\\ download, build and distribute -- http://www.A-A-P.org ///
            \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

            --
            You received this message from the "vim_multibyte" maillist.
            For more information, visit http://www.vim.org/maillist.php

            To unsubscribe, reply using "remove me" as the subject.
          • Tony Mechelynck
            ... [...] ... Yes; in my browser and with my usual font most (but not all) of them are CJK fullwidth ideograms and full-width counterparts of halfwidth math
            Message 5 of 8 , Apr 12, 2010
            • 0 Attachment
              On 11/04/10 17:33, Bram Moolenaar wrote:
              >
              > Tony Mechelynck wrote:
              [...]
              >> IMHO it makes no sense to have digraphs in Vim for "private use"
              >> characters. I propose to drop any of them that cannot be usefully
              >> reassigned to some "official" Unicode codepoint elsewhere. E000 to E028
              >> means forty-one codepoints, it ought not to be a big problem.
              >
              > Searching revealed a few proposals for these character ranges. And
              > this page has a confusing summary:
              > http://en.wikibooks.org/wiki/Unicode/Character_reference/E000-EFFF
              > "private use" but it does have a table with characters.

              Yes; in my browser and with my usual font most (but not all) of them are
              CJK fullwidth ideograms and full-width counterparts of halfwidth math
              symbols etc. A few are (halfwidth) Latin accented letters which even
              exist in Latin1 i.e. below U+0100 !!! For instance (in my browser)
              U+E023 to U+E081 look like duplicates of ASCII 0x21 to 0x7E in the same
              order. Note however the last sentence immediately before the table:

              «The repertoire seen with your computer's font will most likely not be
              the same as with other computers or fonts.»

              And indeed I see a different glyph for those codepoints in gvim with my
              usual 'guifont', which is not the same as my browser's usual serif and
              sans-serif fonts.

              >
              > Let's remove these digraphs. I can't imagine anyone is using them.
              >

              Neither can I.


              Best regards,
              Tony.
              --
              LAUNCELOT leaps into SHOT with a mighty cry and runs the GUARD
              through and
              hacks him to the floor. Blood. Swashbuckling music (perhaps).
              LAUNCELOT races through into the castle screaming.
              SECOND SENTRY: Hey!
              "Monty Python and the Holy Grail" PYTHON (MONTY)
              PICTURES LTD

              --
              You received this message from the "vim_multibyte" maillist.
              For more information, visit http://www.vim.org/maillist.php
            Your message has been successfully submitted and would be delivered to recipients shortly.