Loading ...
Sorry, an error occurred while loading the content.

Re: Combining characters U+035x are not supported?

Expand Messages
  • Iông Chun
    Hi Tony and list, I understand now. Characters ranged from U+0350 to U+0357, and from U+035D to U+035F, are added in Unicode 4.0. Characters ranged from U+0358
    Message 1 of 11 , Jan 2, 2010
    • 0 Attachment
      Hi Tony and list,

      I understand now.
      Characters ranged from U+0350 to U+0357, and from U+035D to U+035F,
      are added in Unicode 4.0.
      Characters ranged from U+0358 (which I used) to U+035C are added even
      later in Unicode 4.1.
      Check this from this docuent: http://www.unicode.org/Public/UNIDATA/DerivedAge.txt

      I use Vim version 7.2.323, its combining character table seems contain
      most of Unicode 4.0,
      but without U+0350 to U+0357 and U+035D to U+035F.

      I will make a patch to add those additional combining characters,
      according to Unicode 5.2.

      Iông Chun

      On 1月2日, 下午1時25分, Tony Mechelynck <antoine.mechely...@...>
      wrote:
      > On 02/01/10 04:47, I ng Chun wrote:
      >
      >
      >
      > > Happy new year, everyone!
      >
      > > I use Vim to edit an input method table these days, and find that
      > > it doesn't work well with one combining character, Unicode U+0358.
      > > Using g8 on "capital O with dot above right" shows "4f" only,
      > > and the dot is another character ("cd 98").
      >
      > > I look from the source to see how Vim check if a character is a
      > > combining
      > > character, and find the following code:
      >
      > > <code>
      > > /*
      > >   * Return TRUE if "c" is a composing UTF-8 character.  This means it
      > > will be
      > >   * drawn on top of the preceding character.
      > >   * Based on code from Markus Kuhn.
      > >   */
      > >      int
      > > utf_iscomposing(c)
      > >      int         c;
      > > {
      > >      /* sorted list of non-overlapping intervals */
      > >      static struct interval combining[] =
      > >      {
      > >          {0x0300, 0x034f}, {0x0360, 0x036f}, {0x0483, 0x0486}, {0x0488,
      > > 0x0489},
      > > </code>
      >
      > > Is there a reason why characters from 0x0350 to 0x035f are skipped?
      > > It looks like to me that characters in Unicode block
      > > 'Combining Diacritical Marks', range from 0x0300 to 0x036f,
      > > should all be combing characters.
      >
      > > A small patch likes this works well to me,
      > > using g8 on "capital O with dot above right" shows "4f + cd 98" now:
      >
      > > <patch>
      > > --- work/vim72/src/mbyte.c~     2010-01-02 10:18:01.000000000 +0800
      > > +++ work/vim72/src/mbyte.c      2010-01-02 11:19:24.000000000 +0800
      > > @@ -1976,7 +1976,7 @@
      > >       /* sorted list of non-overlapping intervals */
      > >       static struct interval combining[] =
      > >       {
      > > -       {0x0300, 0x034f}, {0x0360, 0x036f}, {0x0483, 0x0486}, {0x0488,
      > > 0x0489},
      > > +       {0x0300, 0x036f}, {0x0483, 0x0486}, {0x0488, 0x0489},
      > >          {0x0591, 0x05a1}, {0x05a3, 0x05b9}, {0x05bb, 0x05bd}, {0x05bf,
      > > 0x05bf},
      > >          {0x05c1, 0x05c2}, {0x05c4, 0x05c4}, {0x0610, 0x0615}, {0x064b,
      > > 0x0658},
      > >          {0x0670, 0x0670}, {0x06d6, 0x06dc}, {0x06de, 0x06e4}, {0x06e7,
      > > 0x06e8},
      > > </patch>
      >
      > > Regards,
      > > I ng Chun
      >
      > According to the current version (5.2) of the Unicode Standard, all
      > codepoints U+0300 to U+036F are indeed combining characters AFAICT, seehttp://www.unicode.org/charts/PDF/U0300.pdf
      >
      > However, those in the range U+0350 to U+035F are particularly
      > "esoteric", and I believe that it is possible that they were added in a
      > relatively recent version of the Standard; previous versions (including,
      > maybe, the one which was current when that module was written) would
      > then have these codepoints "undefined".
      >
      > BTW, I notice in a comment at lines 29-30 of that same module:
      >
      > >  *             To make things complicated, up to two composing characters
      > >  *             are allowed.  These are drawn on top of the first char.
      >
      > This is now only true with the default settings. The 'maxcombine' option
      > was added (relatively recently) to allow displaying (if the user sets a
      > non-default value) up to 6 combining characters on top of each spacing
      > character; even more than that can be "edited but not displayed".
      > Shouldn't that comment be updated?
      >
      > Best regards,
      > Tony.
      > --
      > Children are unpredictable.  You never know what inconsistency they're
      > going to catch you in next.
      >                 -- Franklin P. Jones

      --
      You received this message from the "vim_multibyte" maillist.
      For more information, visit http://www.vim.org/maillist.php
    • Iông Chun
      Hi all, This is a patch to update combining character table from Unicode 5.2. I add these by eyes and hands, so there might be errors ;) ... +++
      Message 2 of 11 , Jan 2, 2010
      • 0 Attachment
        Hi all,

        This is a patch to update combining character table from Unicode 5.2.
        I add these by eyes and hands, so there might be errors ;)

        <patch>
        --- work/vim72/src/mbyte.c.bak 2010-01-02 13:33:40.000000000 +0800
        +++ work/vim72/src/mbyte.c 2010-01-02 18:33:49.000000000 +0800
        @@ -1976,35 +1976,64 @@
        /* sorted list of non-overlapping intervals */
        static struct interval combining[] =
        {
        - {0x0300, 0x034f}, {0x0360, 0x036f}, {0x0483, 0x0486}, {0x0488,
        0x0489},
        - {0x0591, 0x05a1}, {0x05a3, 0x05b9}, {0x05bb, 0x05bd}, {0x05bf,
        0x05bf},
        - {0x05c1, 0x05c2}, {0x05c4, 0x05c4}, {0x0610, 0x0615}, {0x064b,
        0x0658},
        - {0x0670, 0x0670}, {0x06d6, 0x06dc}, {0x06de, 0x06e4}, {0x06e7,
        0x06e8},
        - {0x06ea, 0x06ed}, {0x0711, 0x0711}, {0x0730, 0x074a}, {0x07a6,
        0x07b0},
        - {0x0901, 0x0903}, {0x093c, 0x093c}, {0x093e, 0x094d}, {0x0951,
        0x0954},
        + {0x0300, 0x036f},
        + {0x0483, 0x0487}, {0x0488, 0x0489},
        + {0x0591, 0x05bd}, {0x05bf, 0x05bf}, {0x05c1, 0x05c2}, {0x05c4,
        0x05c5},
        + {0x05c7, 0x05c7},
        + {0x0610, 0x061a}, {0x064b, 0x065e}, {0x0670, 0x0670}, {0x06d6,
        0x06dc},
        + {0x06de, 0x06e4}, {0x06e7, 0x06e8}, {0x06ea, 0x06ed},
        + {0x0711, 0x0711}, {0x0730, 0x074a}, {0x07a6, 0x07b0}, {0x07eb,
        0x07f3},
        + {0x0816, 0x0819}, {0x081b, 0x0823}, {0x0825, 0x0827}, {0x0829,
        0x082d},
        + {0x0900, 0x0903}, {0x093c, 0x093c}, {0x093e, 0x094e}, {0x0951,
        0x0955},
        {0x0962, 0x0963}, {0x0981, 0x0983}, {0x09bc, 0x09bc}, {0x09be,
        0x09c4},
        {0x09c7, 0x09c8}, {0x09cb, 0x09cd}, {0x09d7, 0x09d7}, {0x09e2,
        0x09e3},
        {0x0a01, 0x0a03}, {0x0a3c, 0x0a3c}, {0x0a3e, 0x0a42}, {0x0a47,
        0x0a48},
        - {0x0a4b, 0x0a4d}, {0x0a70, 0x0a71}, {0x0a81, 0x0a83}, {0x0abc,
        0x0abc},
        - {0x0abe, 0x0ac5}, {0x0ac7, 0x0ac9}, {0x0acb, 0x0acd}, {0x0ae2,
        0x0ae3},
        - {0x0b01, 0x0b03}, {0x0b3c, 0x0b3c}, {0x0b3e, 0x0b43}, {0x0b47,
        0x0b48},
        - {0x0b4b, 0x0b4d}, {0x0b56, 0x0b57}, {0x0b82, 0x0b82}, {0x0bbe,
        0x0bc2},
        - {0x0bc6, 0x0bc8}, {0x0bca, 0x0bcd}, {0x0bd7, 0x0bd7}, {0x0c01,
        0x0c03},
        - {0x0c3e, 0x0c44}, {0x0c46, 0x0c48}, {0x0c4a, 0x0c4d}, {0x0c55,
        0x0c56},
        - {0x0c82, 0x0c83}, {0x0cbc, 0x0cbc}, {0x0cbe, 0x0cc4}, {0x0cc6,
        0x0cc8},
        - {0x0cca, 0x0ccd}, {0x0cd5, 0x0cd6}, {0x0d02, 0x0d03}, {0x0d3e,
        0x0d43},
        - {0x0d46, 0x0d48}, {0x0d4a, 0x0d4d}, {0x0d57, 0x0d57}, {0x0d82,
        0x0d83},
        - {0x0dca, 0x0dca}, {0x0dcf, 0x0dd4}, {0x0dd6, 0x0dd6}, {0x0dd8,
        0x0ddf},
        - {0x0df2, 0x0df3}, {0x0e31, 0x0e31}, {0x0e34, 0x0e3a}, {0x0e47,
        0x0e4e},
        - {0x0eb1, 0x0eb1}, {0x0eb4, 0x0eb9}, {0x0ebb, 0x0ebc}, {0x0ec8,
        0x0ecd},
        + {0x0a4b, 0x0a4d}, {0x0a51, 0x0a51}, {0x0a70, 0x0a71}, {0x0a75,
        0x0a75},
        + {0x0a81, 0x0a83}, {0x0abc, 0x0abc}, {0x0abe, 0x0ac5}, {0x0ac7,
        0x0ac9},
        + {0x0acb, 0x0acd}, {0x0ae2, 0x0ae3},
        + {0x0b01, 0x0b03}, {0x0b3c, 0x0b3c}, {0x0b3e, 0x0b44}, {0x0b47,
        0x0b48},
        + {0x0b4b, 0x0b4d}, {0x0b56, 0x0b57}, {0x0b62, 0x0b63}, {0x0b82,
        0x0b82},
        + {0x0bbe, 0x0bc2}, {0x0bc6, 0x0bc8}, {0x0bca, 0x0bcd}, {0x0bd7,
        0x0bd7},
        + {0x0c01, 0x0c03}, {0x0c3e, 0x0c44}, {0x0c46, 0x0c48}, {0x0c4a,
        0x0c4d},
        + {0x0c55, 0x0c56}, {0x0c62, 0x0c63}, {0x0c82, 0x0c83}, {0x0cbc,
        0x0cbc},
        + {0x0cbe, 0x0cc4}, {0x0cc6, 0x0cc8}, {0x0cca, 0x0ccd}, {0x0cd5,
        0x0cd6},
        + {0x0ce2, 0x0ce3},
        + {0x0d02, 0x0d03}, {0x0d3e, 0x0d43}, {0x0d46, 0x0d48}, {0x0d4a,
        0x0d4d},
        + {0x0d57, 0x0d57}, {0x0d82, 0x0d83}, {0x0dca, 0x0dca}, {0x0dcf,
        0x0dd4},
        + {0x0dd6, 0x0dd6}, {0x0dd8, 0x0ddf}, {0x0df2, 0x0df3},
        + {0x0e31, 0x0e31}, {0x0e34, 0x0e3a}, {0x0e47, 0x0e4e}, {0x0eb1,
        0x0eb1},
        + {0x0eb4, 0x0eb9}, {0x0ebb, 0x0ebc}, {0x0ec8, 0x0ecd},
        {0x0f18, 0x0f19}, {0x0f35, 0x0f35}, {0x0f37, 0x0f37}, {0x0f39,
        0x0f39},
        {0x0f3e, 0x0f3f}, {0x0f71, 0x0f84}, {0x0f86, 0x0f87}, {0x0f90,
        0x0f97},
        - {0x0f99, 0x0fbc}, {0x0fc6, 0x0fc6}, {0x102c, 0x1032}, {0x1036,
        0x1039},
        - {0x1056, 0x1059}, {0x1712, 0x1714}, {0x1732, 0x1734}, {0x1752,
        0x1753},
        - {0x1772, 0x1773}, {0x17b6, 0x17d3}, {0x17dd, 0x17dd}, {0x180b,
        0x180d},
        - {0x18a9, 0x18a9}, {0x1920, 0x192b}, {0x1930, 0x193b}, {0x20d0,
        0x20ea},
        - {0x302a, 0x302f}, {0x3099, 0x309a}, {0xfb1e, 0xfb1e}, {0xfe00,
        0xfe0f},
        - {0xfe20, 0xfe23},
        + {0x0f99, 0x0fbc}, {0x0fc6, 0x0fc6},
        + {0x102b, 0x103e}, {0x1056, 0x1059}, {0x105e, 0x1060}, {0x1062,
        0x1064},
        + {0x1067, 0x106d}, {0x1071, 0x1074}, {0x1082, 0x108d}, {0x108f,
        0x108f},
        + {0x109a, 0x109d},
        + {0x135f, 0x135f},
        + {0x1712, 0x1714}, {0x1732, 0x1734}, {0x1752, 0x1753}, {0x1772,
        0x1773},
        + {0x17b6, 0x17d3}, {0x17dd, 0x17dd},
        + {0x180b, 0x180d}, {0x18a9, 0x18a9},
        + {0x1920, 0x192b}, {0x1930, 0x193b}, {0x19b0, 0x19c0}, {0x19c8,
        0x19c9},
        + {0x1a17, 0x1a1b}, {0x1a55, 0x1a5e}, {0x1a60, 0x1a7c}, {0x1a7f,
        0x1a7f},
        + {0x1b00, 0x1b04}, {0x1b34, 0x1b44}, {0x1b6b, 0x1b73}, {0x1b80,
        0x1b82},
        + {0x1ba1, 0x1baa},
        + {0x1c24, 0x1c37}, {0x1cd0, 0x1cd2}, {0x1cd4, 0x1ce8}, {0x1ced,
        0x1ced},
        + {0x1cf2, 0x1cf2},
        + {0x1dc0, 0x1de6}, {0x1dfd, 0x1dff},
        + {0x20d0, 0x20f0},
        + {0x2cef, 0x2cf1},
        + {0x2de0, 0x2dff},
        + {0x302a, 0x302f}, {0x3099, 0x309a},
        + {0xa66f, 0xa672}, {0xa67c, 0xa67d}, {0xa6f0, 0xa6f1},
        + {0xa802, 0xa802}, {0xa806, 0xa806}, {0xa80b, 0xa80b}, {0xa823,
        0xa827},
        + {0xa880, 0xa881}, {0xa8b4, 0xa8c4}, {0xa8e0, 0xa8f1},
        + {0xa926, 0xa92d}, {0xa947, 0xa953}, {0xa980, 0xa983}, {0xa9b3,
        0xa9c0},
        + {0xaa29, 0xaa36}, {0xaa43, 0xaa43}, {0xaa4c, 0xaa4d}, {0xaa7b,
        0xaa7b},
        + {0xaab0, 0xaab0}, {0xaab2, 0xaab4}, {0xaab7, 0xaab8}, {0xaabe,
        0xaabf},
        + {0xaac1, 0xaac1},
        + {0xabe3, 0xabea}, {0xabec, 0xabed},
        + {0xfb1e, 0xfb1e},
        + {0xfe00, 0xfe0f}, {0xfe20, 0xfe26},
        };

        return intable(combining, sizeof(combining), c);
        </patch>


        Regards,
        Iông Chun

        On 1月2日, 下午4時30分, Iông Chun <yongj...@...> wrote:
        > Hi Tony and list,
        >
        > I understand now.
        > Characters ranged from U+0350 to U+0357, and from U+035D to U+035F,
        > are added in Unicode 4.0.
        > Characters ranged from U+0358 (which I used) to U+035C are added even
        > later in Unicode 4.1.
        > Check this from this docuent:http://www.unicode.org/Public/UNIDATA/DerivedAge.txt
        >
        > I use Vim version 7.2.323, its combining character table seems contain
        > most of Unicode 4.0,
        > but without U+0350 to U+0357 and U+035D to U+035F.
        >
        > I will make a patch to add those additional combining characters,
        > according to Unicode 5.2.
        >
        > Iông Chun
        >
        > On 1月2日, 下午1時25分, Tony Mechelynck <antoine.mechely...@...>
        > wrote:
        >
        > > On 02/01/10 04:47, I ng Chun wrote:
        >
        > > > Happy new year, everyone!
        >
        > > > I use Vim to edit an input method table these days, and find that
        > > > it doesn't work well with one combining character, Unicode U+0358.
        > > > Using g8 on "capital O with dot above right" shows "4f" only,
        > > > and the dot is another character ("cd 98").
        >
        > > > I look from the source to see how Vim check if a character is a
        > > > combining
        > > > character, and find the following code:
        >
        > > > <code>
        > > > /*
        > > >   * Return TRUE if "c" is a composing UTF-8 character.  This means it
        > > > will be
        > > >   * drawn on top of the preceding character.
        > > >   * Based on code from Markus Kuhn.
        > > >   */
        > > >      int
        > > > utf_iscomposing(c)
        > > >      int         c;
        > > > {
        > > >      /* sorted list of non-overlapping intervals */
        > > >      static struct interval combining[] =
        > > >      {
        > > >          {0x0300, 0x034f}, {0x0360, 0x036f}, {0x0483, 0x0486}, {0x0488,
        > > > 0x0489},
        > > > </code>
        >
        > > > Is there a reason why characters from 0x0350 to 0x035f are skipped?
        > > > It looks like to me that characters in Unicode block
        > > > 'Combining Diacritical Marks', range from 0x0300 to 0x036f,
        > > > should all be combing characters.
        >
        > > > A small patch likes this works well to me,
        > > > using g8 on "capital O with dot above right" shows "4f + cd 98" now:
        >
        > > > <patch>
        > > > --- work/vim72/src/mbyte.c~     2010-01-02 10:18:01.000000000 +0800
        > > > +++ work/vim72/src/mbyte.c      2010-01-02 11:19:24.000000000 +0800
        > > > @@ -1976,7 +1976,7 @@
        > > >       /* sorted list of non-overlapping intervals */
        > > >       static struct interval combining[] =
        > > >       {
        > > > -       {0x0300, 0x034f}, {0x0360, 0x036f}, {0x0483, 0x0486}, {0x0488,
        > > > 0x0489},
        > > > +       {0x0300, 0x036f}, {0x0483, 0x0486}, {0x0488, 0x0489},
        > > >          {0x0591, 0x05a1}, {0x05a3, 0x05b9}, {0x05bb, 0x05bd}, {0x05bf,
        > > > 0x05bf},
        > > >          {0x05c1, 0x05c2}, {0x05c4, 0x05c4}, {0x0610, 0x0615}, {0x064b,
        > > > 0x0658},
        > > >          {0x0670, 0x0670}, {0x06d6, 0x06dc}, {0x06de, 0x06e4}, {0x06e7,
        > > > 0x06e8},
        > > > </patch>
        >
        > > > Regards,
        > > > I ng Chun
        >
        > > According to the current version (5.2) of the Unicode Standard, all
        > > codepoints U+0300 to U+036F are indeed combining characters AFAICT, seehttp://www.unicode.org/charts/PDF/U0300.pdf
        >
        > > However, those in the range U+0350 to U+035F are particularly
        > > "esoteric", and I believe that it is possible that they were added in a
        > > relatively recent version of the Standard; previous versions (including,
        > > maybe, the one which was current when that module was written) would
        > > then have these codepoints "undefined".
        >
        > > BTW, I notice in a comment at lines 29-30 of that same module:
        >
        > > >  *             To make things complicated, up to two composing characters
        > > >  *             are allowed.  These are drawn on top of the first char.
        >
        > > This is now only true with the default settings. The 'maxcombine' option
        > > was added (relatively recently) to allow displaying (if the user sets a
        > > non-default value) up to 6 combining characters on top of each spacing
        > > character; even more than that can be "edited but not displayed".
        > > Shouldn't that comment be updated?
        >
        > > Best regards,
        > > Tony.
        > > --
        > > Children are unpredictable.  You never know what inconsistency they're
        > > going to catch you in next.
        > >                 -- Franklin P. Jones

        --
        You received this message from the "vim_multibyte" maillist.
        For more information, visit http://www.vim.org/maillist.php
      • Tony Mechelynck
        ... I m attaching an extract from the current UnicodeData.txt file where I ve extracted all codepoints with a nonzero Canonical_Combining_Class (field 3,
        Message 3 of 11 , Jan 2, 2010
        • 0 Attachment
          On 02/01/10 09:30, Iông Chun wrote:
          > Hi Tony and list,
          >
          > I understand now.
          > Characters ranged from U+0350 to U+0357, and from U+035D to U+035F,
          > are added in Unicode 4.0.
          > Characters ranged from U+0358 (which I used) to U+035C are added even
          > later in Unicode 4.1.
          > Check this from this docuent: http://www.unicode.org/Public/UNIDATA/DerivedAge.txt
          >
          > I use Vim version 7.2.323, its combining character table seems contain
          > most of Unicode 4.0,
          > but without U+0350 to U+0357 and U+035D to U+035F.
          >
          > I will make a patch to add those additional combining characters,
          > according to Unicode 5.2.
          >
          > Iông Chun
          >

          I'm attaching an extract from the current UnicodeData.txt file where
          I've extracted all codepoints with a nonzero Canonical_Combining_Class
          (field 3, counting the first field [codepoint number] as field 0). I'm
          *not* sure that this property coincides with the "combining character"
          property in the Vim sense, but it's the best I've found. You can check
          any discrepancies by means of
          http://www.unicode.org/Public/UNIDATA/UnicodeData.txt (where the first
          two fields are the codepoint number and name).

          This was obtained by applying :redir to the output of

          silent %g/^\%([^;]*;\)\{3}\%(0;\)\@!/p

          meaning: print all lines containing, at the start of a line, three times
          (zero or more non-semicolons plus one semicolon) not followed by (a zero
          then a semicolon).


          Best regards,
          Tony.
          --
          "I do not know myself, and God forbid that I should."
          -- Johann Wolfgang von Goethe

          --
          You received this message from the "vim_multibyte" maillist.
          For more information, visit http://www.vim.org/maillist.php
        • Iông Chun
          Hi Tony, ... I should also make use of UnicodeData.txt, instead of looking into every added code point, and check the code charts ;) About
          Message 4 of 11 , Jan 2, 2010
          • 0 Attachment
            Hi Tony,

            On 2010-01-02 07:01 ē-po͘, Tony Mechelynck wrote:
            > I'm attaching an extract from the current UnicodeData.txt file where
            > I've extracted all codepoints with a nonzero Canonical_Combining_Class
            > (field 3, counting the first field [codepoint number] as field 0). I'm
            > *not* sure that this property coincides with the "combining character"
            > property in the Vim sense, but it's the best I've found. You can check
            > any discrepancies by means of
            > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt (where the first
            > two fields are the codepoint number and name).
            >
            > This was obtained by applying :redir to the output of
            >
            > silent %g/^\%([^;]*;\)\{3}\%(0;\)\@!/p
            >
            > meaning: print all lines containing, at the start of a line, three
            > times (zero or more non-semicolons plus one semicolon) not followed by
            > (a zero then a semicolon).
            >
            >
            > Best regards,
            > Tony.

            I should also make use of UnicodeData.txt, instead of looking into every
            added code point,
            and check the code charts ;)

            About Canonical_Combining_Class, from the Standard version 5.2, D52,
            item#2, I read:
            <quote>
            All characters with non-zero canonical combining class are combining charac-
            ters, but the reverse is not the case: there are combining characters
            with a zero
            canonical combining class.
            </quote>

            and item#1:
            <quote>
            Combining characters consist of all characters with the General Category
            val-
            ues of Spacing Combining Mark (Mc), Nonspacing Mark (Mn), and Enclosing
            Mark (Me).
            </quote>

            and D53:
            <quote>
            Nonspacing mark: A combining character with the General Category of
            Nonspacing
            Mark (Mn) or Enclosing Mark (Me).
            </quote>

            I don't know if Vim has different rule for display and semantic, in
            checking of
            combing characters. If no, I think the table could just contain those
            nonspacing ones now.

            I attach the list of those Mn and Me ones, without code points of value
            larger than U+FFFF.

            Regards,
            Iông Chun

            --
            You received this message from the "vim_multibyte" maillist.
            For more information, visit http://www.vim.org/maillist.php
          • Tony Mechelynck
            ... Why without codepoint values higher than U+FFFF? Nowadays gvim can diplay them (which wasn t the case when I started studying Unicode with gvim 6.x). Best
            Message 5 of 11 , Jan 2, 2010
            • 0 Attachment
              On 02/01/10 15:47, Iông Chun wrote:
              > Hi Tony,
              >
              > On 2010-01-02 07:01 ē-po͘, Tony Mechelynck wrote:
              >> I'm attaching an extract from the current UnicodeData.txt file where
              >> I've extracted all codepoints with a nonzero Canonical_Combining_Class
              >> (field 3, counting the first field [codepoint number] as field 0). I'm
              >> *not* sure that this property coincides with the "combining character"
              >> property in the Vim sense, but it's the best I've found. You can check
              >> any discrepancies by means of
              >> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt (where the first
              >> two fields are the codepoint number and name).
              >>
              >> This was obtained by applying :redir to the output of
              >>
              >> silent %g/^\%([^;]*;\)\{3}\%(0;\)\@!/p
              >>
              >> meaning: print all lines containing, at the start of a line, three
              >> times (zero or more non-semicolons plus one semicolon) not followed by
              >> (a zero then a semicolon).
              >>
              >>
              >> Best regards,
              >> Tony.
              >
              > I should also make use of UnicodeData.txt, instead of looking into every
              > added code point,
              > and check the code charts ;)
              >
              > About Canonical_Combining_Class, from the Standard version 5.2, D52,
              > item#2, I read:
              > <quote>
              > All characters with non-zero canonical combining class are combining
              > charac-
              > ters, but the reverse is not the case: there are combining characters
              > with a zero
              > canonical combining class.
              > </quote>
              >
              > and item#1:
              > <quote>
              > Combining characters consist of all characters with the General Category
              > val-
              > ues of Spacing Combining Mark (Mc), Nonspacing Mark (Mn), and Enclosing
              > Mark (Me).
              > </quote>
              >
              > and D53:
              > <quote>
              > Nonspacing mark: A combining character with the General Category of
              > Nonspacing
              > Mark (Mn) or Enclosing Mark (Me).
              > </quote>
              >
              > I don't know if Vim has different rule for display and semantic, in
              > checking of
              > combing characters. If no, I think the table could just contain those
              > nonspacing ones now.
              >
              > I attach the list of those Mn and Me ones, without code points of value
              > larger than U+FFFF.
              >
              > Regards,
              > Iông Chun
              >

              Why without codepoint values higher than U+FFFF? Nowadays gvim can
              diplay them (which wasn't the case when I started studying Unicode with
              gvim 6.x).


              Best regards,
              Tony.
              --
              hundred-and-one symptoms of being an internet addict:
              236. You start saving URL's in your digital watch.

              --
              You received this message from the "vim_multibyte" maillist.
              For more information, visit http://www.vim.org/maillist.php
            • Iông Chun
              ... Because: struct interval { unsigned short first; unsigned short last; }; ;) I guess the type can be int instead of unsigned short now.
              Message 6 of 11 , Jan 2, 2010
              • 0 Attachment
                On 2010/01/03 00:24, Tony Mechelynck wrote:
                > Why without codepoint values higher than U+FFFF? Nowadays gvim can
                > diplay them (which wasn't the case when I started studying Unicode
                > with gvim 6.x).
                >
                >
                > Best regards,
                > Tony.

                Because:
                <code>
                struct interval
                {
                unsigned short first;
                unsigned short last;
                };
                </code>
                ;)

                I guess the type can be "int" instead of "unsigned short" now.
                The patch with all Mn and Me character ranges is attached.

                Regards,
                Iông Chun

                --
                You received this message from the "vim_multibyte" maillist.
                For more information, visit http://www.vim.org/maillist.php
              • Tony Mechelynck
                ... I see. I suspect other size changes may have to be done then, not only where the structure is defined but possibly where it is used. I hope Bram is
                Message 7 of 11 , Jan 2, 2010
                • 0 Attachment
                  On 03/01/10 03:54, Iông Chun wrote:
                  > On 2010/01/03 00:24, Tony Mechelynck wrote:
                  >> Why without codepoint values higher than U+FFFF? Nowadays gvim can
                  >> diplay them (which wasn't the case when I started studying Unicode
                  >> with gvim 6.x).
                  >>
                  >>
                  >> Best regards,
                  >> Tony.
                  >
                  > Because:
                  > <code>
                  > struct interval
                  > {
                  > unsigned short first;
                  > unsigned short last;
                  > };
                  > </code>
                  > ;)
                  >
                  > I guess the type can be "int" instead of "unsigned short" now.
                  > The patch with all Mn and Me character ranges is attached.
                  >
                  > Regards,
                  > Iông Chun
                  >

                  I see. I suspect other size changes may have to be done then, not only
                  where the structure is defined but possibly where it is used. I hope
                  Bram is following this whole thread.

                  Best regards,
                  Tony.
                  --
                  "A Mormon is a man that has the bad taste and the religion to do what a
                  good many other people are restrained from doing by conscientious
                  scruples and the police."
                  -- Mr. Dooley

                  --
                  You received this message from the "vim_multibyte" maillist.
                  For more information, visit http://www.vim.org/maillist.php
                • Bram Moolenaar
                  ... There is a script to generate these tables from the Unicode table. I think Markus Kuhn had this. But it should be easy to reproduce with Vim script.
                  Message 8 of 11 , Jan 4, 2010
                  • 0 Attachment
                    Tony Mechelynck wrote:

                    > On 03/01/10 03:54, Iông Chun wrote:
                    > > On 2010/01/03 00:24, Tony Mechelynck wrote:
                    > >> Why without codepoint values higher than U+FFFF? Nowadays gvim can
                    > >> diplay them (which wasn't the case when I started studying Unicode
                    > >> with gvim 6.x).
                    > >>
                    > >>
                    > >> Best regards,
                    > >> Tony.
                    > >
                    > > Because:
                    > > <code>
                    > > struct interval
                    > > {
                    > > unsigned short first;
                    > > unsigned short last;
                    > > };
                    > > </code>
                    > > ;)
                    > >
                    > > I guess the type can be "int" instead of "unsigned short" now.
                    > > The patch with all Mn and Me character ranges is attached.
                    > >
                    > > Regards,
                    > > Iông Chun
                    > >
                    >
                    > I see. I suspect other size changes may have to be done then, not only
                    > where the structure is defined but possibly where it is used. I hope
                    > Bram is following this whole thread.

                    There is a script to generate these tables from the Unicode table.
                    I think Markus Kuhn had this. But it should be easy to reproduce with
                    Vim script.

                    Changing all these tables from short to int makes the memory use higher.
                    But adding code to handle two tables won't be much smaller.

                    --
                    hundred-and-one symptoms of being an internet addict:
                    77. The phone company asks you to test drive their new PBX system

                    /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                    /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
                    \\\ download, build and distribute -- http://www.A-A-P.org ///
                    \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

                    --
                    You received this message from the "vim_multibyte" maillist.
                    For more information, visit http://www.vim.org/maillist.php
                  • Tony Mechelynck
                    On 04/01/10 20:17, Bram Moolenaar wrote: [...] ... [...] Yes indeed: this Unidata.txt file is meant to be machine-readable, and with the power of Vim regexps
                    Message 9 of 11 , Jan 8, 2010
                    • 0 Attachment
                      On 04/01/10 20:17, Bram Moolenaar wrote:
                      [...]
                      >
                      > There is a script to generate these tables from the Unicode table.
                      > I think Markus Kuhn had this. But it should be easy to reproduce with
                      > Vim script.
                      >
                      [...]

                      Yes indeed: this Unidata.txt file is meant to be machine-readable, and
                      with the power of Vim regexps at our disposal, extracting the needed
                      data should be a breeze.


                      Best regards,
                      Tony.
                      --
                      Her locks an ancient lady gave
                      Her loving husband's life to save;
                      And men -- they honored so the dame --
                      Upon some stars bestowed her name.

                      But to our modern married fair,
                      Who'd give their lords to save their hair,
                      No stellar recognition's given.
                      There are not stars enough in heaven.
                    Your message has been successfully submitted and would be delivered to recipients shortly.