Loading ...
Sorry, an error occurred while loading the content.

Re: possible to make iskeyword supports multibyte charactor?

Expand Messages
  • Tony Mechelynck
    ... [...] AFAIK, most Vim developers read not only the vim_dev group but also the vim_use group. Best regards, Tony. -- The trouble with doing something right
    Message 1 of 17 , Jan 2, 2009
    • 0 Attachment
      On 02/01/09 15:32, StarWing wrote:
      > or anyone can make developer know this?
      [...]

      AFAIK, most Vim developers read not only the vim_dev group but also the
      vim_use group.

      Best regards,
      Tony.
      --
      The trouble with doing something right the first time is that nobody
      appreciates how difficult it was.

      --~--~---------~--~----~------------~-------~--~----~
      You received this message from the "vim_use" maillist.
      For more information, visit http://www.vim.org/maillist.php
      -~----------~----~----~----~------~----~------~--~---
    • pansz
      ... This seems to hint vim is not using the standard iswalpha(), iswpunct() series widechar-type-check functions in . As far as I know the
      Message 2 of 17 , Jan 3, 2009
      • 0 Attachment
        Tony Mechelynck 写道:
        > For the meaning of its settings, ":help 'iskeyword'" resends to ":help
        > 'isfname'" where it is said:
        >
        >> Multi-byte characters 256 and above are always included, only the
        >> characters up to 255 are specified with this option.
        >> For UTF-8 the characters 0xa0 to 0xff are included as well.
        >
        > IOW it is not possible to treat some hanzi as 'iskeyword' characters and
        > others not. I think the above means that even the "ideographic
        > full-width space" U+3000 is treated as a keyword character, OTOH I
        > wouldn't affirm this without an experiment (maybe Vim with +multi_byte
        > knows about the main divisions of the Unicode codepoint range).

        This seems to hint vim is not using the standard iswalpha(), iswpunct()
        series widechar-type-check functions in <wctypes.h>.

        As far as I know the iswalpha() returns true only on true hanzi
        characters and will not return true on characters such as "ideographic
        full-width space".

        I guess this is a choice for efficiency if vim uses utf-8 internally,
        since utf-8 must be converted to ucs in order to use wctypes.

        If that is the case, making iskeyword supports multibyte character isn't
        hard (I had done similar things for Lua script language), but will
        sacrifice performance.


        --~--~---------~--~----~------------~-------~--~----~
        You received this message from the "vim_use" maillist.
        For more information, visit http://www.vim.org/maillist.php
        -~----------~----~----~----~------~----~------~--~---
      • Tony Mechelynck
        ... If you want to be sure, try some Chinese text with both hanzi and wide-punctuation and see where the yiw (yank inner word) or viw (visual inner word)
        Message 3 of 17 , Jan 3, 2009
        • 0 Attachment
          On 04/01/09 04:07, pansz wrote:
          > Tony Mechelynck 写道:
          >> For the meaning of its settings, ":help 'iskeyword'" resends to ":help
          >> 'isfname'" where it is said:
          >>
          >>> Multi-byte characters 256 and above are always included, only the
          >>> characters up to 255 are specified with this option.
          >>> For UTF-8 the characters 0xa0 to 0xff are included as well.
          >> IOW it is not possible to treat some hanzi as 'iskeyword' characters and
          >> others not. I think the above means that even the "ideographic
          >> full-width space" U+3000 is treated as a keyword character, OTOH I
          >> wouldn't affirm this without an experiment (maybe Vim with +multi_byte
          >> knows about the main divisions of the Unicode codepoint range).
          >
          > This seems to hint vim is not using the standard iswalpha(), iswpunct()
          > series widechar-type-check functions in<wctypes.h>.
          >
          > As far as I know the iswalpha() returns true only on true hanzi
          > characters and will not return true on characters such as "ideographic
          > full-width space".
          >
          > I guess this is a choice for efficiency if vim uses utf-8 internally,
          > since utf-8 must be converted to ucs in order to use wctypes.
          >
          > If that is the case, making iskeyword supports multibyte character isn't
          > hard (I had done similar things for Lua script language), but will
          > sacrifice performance.

          If you want to be sure, try some Chinese text with both hanzi and
          wide-punctuation and see where the yiw (yank inner word) or viw (visual
          inner word) stops. Here's a sample for you: 道可道、非常道。名可名、非常
          名。 ;-)

          In my Huge gvim 7.2.077 with +multi_byte, viw includes neither
          ideographic comma nor ideographic full stop; but AFAIK there's no way to
          tell vim that 不 "not", 故 "thus", 之 "'s" etc. are non-keyword
          characters, since for multibyte characters this kind of status is hardcoded.


          Best regards,
          Tony.
          --
          TV is chewing gum for the eyes.
          -- Frank Lloyd Wright

          --~--~---------~--~----~------------~-------~--~----~
          You received this message from the "vim_use" maillist.
          For more information, visit http://www.vim.org/maillist.php
          -~----------~----~----~----~------~----~------~--~---
        • Sean
          ... It only takes three keystrokes (yi ) to type your example, 一, using my newly-created IME at
          Message 4 of 17 , Jan 3, 2009
          • 0 Attachment
            > Since I found no satisfactory way to use the IM (which _is_
            > installed on my system), I need at least 6 keystrokes to input any
            > hanzi: for instance, for the simplest of them all, the digit one,
            > 一 yi1 U+4E00, I need (after getting into Insert mode) to press
            > Ctrl-V u 4 e 0 0 -- or else, I can use copy-paste if I can find it
            > ready-made in some document.

            It only takes three keystrokes (yi<C-I>) to type your example, 一,
            using my newly-created IME at
            http://vim.sourceforge.net/scripts/script.php?script_id=2506

            Welcome to vim built-in IME :))

            Sean


            On Jan 2, 5:39 am, Tony Mechelynck <antoine.mechely...@...>
            wrote:
            > On 02/01/09 11:30, anhnmncb wrote:
            >
            > > Ping!
            >
            > If you don't get a reply on this ML, the meaning usually is not that
            > nobody saw the question, but rather that nobody knows the answer. Search
            > the help first, then try to make your question clearer if the help
            > doesn't give you an answer (in this case it does, see below).
            >
            >
            >
            >
            >
            > > On 2008-12-31, anhnmncb wrote:
            > >> On 2008-12-31, anhnmncb wrote:
            > >>> Hi, list,
            >
            > >>> when I type Chinese text in vim, I find it's unconvenient for completing
            > >>> Chinese word with C-p/n, because a Chinese word is not seperated by space but
            > >>> some charactors like "and", "or" and others(I use English to reprent a Chinese
            > >>> charactor), so a Chinese sententce will like this:
            >
            > >>> ThisIsAChineseWordInSentence.(This is a Chinese word in sentence.)
            >
            > >>> When I have typed "ThisIsAChineseWordIn", now if I want to type Sen<C-p> then
            > >>> vim can't complete word "Sentence" for me. So I think if iskeyword supports
            > >>> adding Chinese charactor to itself, for example(My client doesn't support
            > >>> Chinese, so I use "and" to represent a Chinese charactor):
            >
            > >>> set iskeyword+="and"
            > >> I meant set iskeyword-="and".
            > >>> then autocompletion will be without problem with Chinese. I don't know if it
            > >>> is easy to handle?
            > >> Also, it will let me can navigate quicker in a long Chinese sentence, now I
            > >> have to use /? or fFtT or some hjkls then input a Chinese charactor(sometimes
            > >> To input a Chinese charactor needs to type at least 3 english charactor).
            >
            > For the meaning of its settings, ":help 'iskeyword'" resends to ":help
            > 'isfname'" where it is said:
            >
            > > Multi-byte characters 256 and above are always included, only the
            > > characters up to 255 are specified with this option.
            > > For UTF-8 the characters 0xa0 to 0xff are included as well.
            >
            > IOW it is not possible to treat some hanzi as 'iskeyword' characters and
            > others not. I think the above means that even the "ideographic
            > full-width space" U+3000 is treated as a keyword character, OTOH I
            > wouldn't affirm this without an experiment (maybe Vim with +multi_byte
            > knows about the main divisions of the Unicode codepoint range).
            >
            > Since I found no satisfactory way to use the IM (which _is_ installed on
            > my system), I need at least 6 keystrokes to input any hanzi: for
            > instance, for the simplest of them all, the digit one, 一 yi1 U+4E00, I
            > need (after getting into Insert mode) to press Ctrl-V u 4 e 0 0 -- or
            > else, I can use copy-paste if I can find it ready-made in some document.
            >
            > Best regards,
            > Tony.
            > --
            > Paradise is exactly like where you are right now ... only much, much
            > better.
            > -- Laurie Anderson
            --~--~---------~--~----~------------~-------~--~----~
            You received this message from the "vim_use" maillist.
            For more information, visit http://www.vim.org/maillist.php
            -~----------~----~----~----~------~----~------~--~---
          • pansz
            ... Interesting, I see the wide punctuation characters are recognized, so vim is using wide character internally, and omitting some particular wide-character
            Message 5 of 17 , Jan 3, 2009
            • 0 Attachment
              Tony Mechelynck 写道:
              > If you want to be sure, try some Chinese text with both hanzi and
              > wide-punctuation and see where the yiw (yank inner word) or viw (visual
              > inner word) stops. Here's a sample for you: 道可道、非常道。名可名、非常
              > 名。 ;-)

              Interesting, I see the wide punctuation characters are recognized, so
              vim is using wide character internally, and omitting some particular
              wide-character from 'iskeyword' shouldn't be hard.

              Then why the 'iskeyword' supports only characters from 0-255?

              --~--~---------~--~----~------------~-------~--~----~
              You received this message from the "vim_use" maillist.
              For more information, visit http://www.vim.org/maillist.php
              -~----------~----~----~----~------~----~------~--~---
            • Tony Mechelynck
              ... I m not sure. I suppose that option was defined before Unicode became well-known, maybe even before it existed, when most charsets were of the 8-bit kind
              Message 6 of 17 , Jan 3, 2009
              • 0 Attachment
                On 04/01/09 06:30, pansz wrote:
                > Tony Mechelynck 写道:
                >> If you want to be sure, try some Chinese text with both hanzi and
                >> wide-punctuation and see where the yiw (yank inner word) or viw (visual
                >> inner word) stops. Here's a sample for you: 道可道、非常道。名可名、非常
                >> 名。 ;-)
                >
                > Interesting, I see the wide punctuation characters are recognized, so
                > vim is using wide character internally, and omitting some particular
                > wide-character from 'iskeyword' shouldn't be hard.
                >
                > Then why the 'iskeyword' supports only characters from 0-255?

                I'm not sure. I suppose that option was defined before Unicode became
                well-known, maybe even before it existed, when most charsets were of the
                8-bit kind except for East-Asian scripts, which required "special" MBCS
                versions of the OSes anyway (such as MS-DOS 2.25).

                Once the Unicode standard was published, it included not only mappings
                of codepoints to glyphs but also quite a lot of metadata about these
                codepoints (such as wide vs. narrow vs. ambiguous, LTR vs. RTL vs.
                ambiguous, lower/ upper/ titlecase, punctuation, number systems, etc.).
                However, Vim versions with -multi_byte must still be supported, and they
                don't have access to that wealth of meta-information. Also, IIUC it's in
                the ASCII range that there is most variation between programming
                languages, operating systems, human languages, etc. concerning which
                characters may be used in which circumstances.


                Best regards,
                Tony.
                --
                If there are epigrams, there must be meta-epigrams.

                --~--~---------~--~----~------------~-------~--~----~
                You received this message from the "vim_use" maillist.
                For more information, visit http://www.vim.org/maillist.php
                -~----------~----~----~----~------~----~------~--~---
              • pansz
                ... Human languages of CJK are not in the ASCII range at all and I bet CJK have more than 30% of the world population. Vim is for programmers, is it _only_ for
                Message 7 of 17 , Jan 3, 2009
                • 0 Attachment
                  Tony Mechelynck 写道:
                  > I'm not sure. I suppose that option was defined before Unicode became
                  > well-known, maybe even before it existed, when most charsets were of the
                  > 8-bit kind except for East-Asian scripts, which required "special" MBCS
                  > versions of the OSes anyway (such as MS-DOS 2.25).
                  >
                  > Once the Unicode standard was published, it included not only mappings
                  > of codepoints to glyphs but also quite a lot of metadata about these
                  > codepoints (such as wide vs. narrow vs. ambiguous, LTR vs. RTL vs.
                  > ambiguous, lower/ upper/ titlecase, punctuation, number systems, etc.).
                  > However, Vim versions with -multi_byte must still be supported, and they
                  > don't have access to that wealth of meta-information. Also, IIUC it's in
                  > the ASCII range that there is most variation between programming
                  > languages, operating systems, human languages, etc. concerning which
                  > characters may be used in which circumstances.

                  Human languages of CJK are not in the ASCII range at all and I bet CJK
                  have more than 30% of the world population. Vim is for programmers, is
                  it _only_ for programmers?

                  The difficulties may be that 'iskeyword' is a whitelist, not a
                  blacklist, we cannot easily blacklist a single Unicode character in
                  'iskeyword' without knowing *all* the Unicode characters which matches
                  iswalpha().

                  Perhaps the simplest approach is to add an option 'isnkeyword' which
                  supports any Unicode character and we can blacklist some Unicode
                  characters while still retain the 'iskeyword' option functioning.



                  --~--~---------~--~----~------------~-------~--~----~
                  You received this message from the "vim_use" maillist.
                  For more information, visit http://www.vim.org/maillist.php
                  -~----------~----~----~----~------~----~------~--~---
                • bill lam
                  On Sun, 04 Jan 2009, pansz wrote: Interesting, I see the wide punctuation characters are recognized, so vim is using wide character internally, and
                  Message 8 of 17 , Jan 3, 2009
                  • 0 Attachment
                    On Sun, 04 Jan 2009, pansz wrote:
                    > Interesting, I see the wide punctuation characters are recognized, so
                    > vim is using wide character internally, and omitting some particular
                    > wide-character from 'iskeyword' shouldn't be hard.
                    >
                    > Then why the 'iskeyword' supports only characters from 0-255?

                    Just wild guess since I've never looked into vim's source code. I
                    think that iskeyword or spellcheck for that matter use FSM to
                    implement the parser. It's ok to have a table of 256 characters but
                    not so easy to work with a table of millions of unicode characters.
                    A quick and dirty workaround is to coerce all non 8-bit characters as
                    white space.

                    --
                    regards,
                    ====================================================
                    GPG key 1024D/4434BAB3 2008-08-24
                    gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3
                    唐詩202 盧綸 晚次鄂州
                    雲開遠見漢陽城 猶是孤帆一日程 估客晝眠知浪靜 舟人夜語覺潮生
                    三湘愁鬢逢秋色 萬里歸心對月明 舊業已隨征戰盡 更堪江上鼓鼙聲

                    --~--~---------~--~----~------------~-------~--~----~
                    You received this message from the "vim_use" maillist.
                    For more information, visit http://www.vim.org/maillist.php
                    -~----------~----~----~----~------~----~------~--~---
                  • Tony Mechelynck
                    ... No, but each hanzi (not fullwidth punct) is supposed to be a word or word part of some kind, with punctuation, whitespace and diacritics all totally
                    Message 9 of 17 , Jan 4, 2009
                    • 0 Attachment
                      On 04/01/09 07:53, pansz wrote:
                      > Tony Mechelynck 写道:
                      >> I'm not sure. I suppose that option was defined before Unicode became
                      >> well-known, maybe even before it existed, when most charsets were of the
                      >> 8-bit kind except for East-Asian scripts, which required "special" MBCS
                      >> versions of the OSes anyway (such as MS-DOS 2.25).
                      >>
                      >> Once the Unicode standard was published, it included not only mappings
                      >> of codepoints to glyphs but also quite a lot of metadata about these
                      >> codepoints (such as wide vs. narrow vs. ambiguous, LTR vs. RTL vs.
                      >> ambiguous, lower/ upper/ titlecase, punctuation, number systems, etc.).
                      >> However, Vim versions with -multi_byte must still be supported, and they
                      >> don't have access to that wealth of meta-information. Also, IIUC it's in
                      >> the ASCII range that there is most variation between programming
                      >> languages, operating systems, human languages, etc. concerning which
                      >> characters may be used in which circumstances.
                      >
                      > Human languages of CJK are not in the ASCII range at all and I bet CJK
                      > have more than 30% of the world population. Vim is for programmers, is
                      > it _only_ for programmers?

                      No, but each hanzi (not fullwidth punct) is supposed to be a "word" or
                      "word part" of some kind, with punctuation, whitespace and diacritics
                      all totally outside the "word" range. "Not" is a word in English,
                      regardless of whether it's used alone or in "cannot" or
                      "notwithstanding". These two uses sound almost Chinese-like to me... who
                      don't really know more than a handful of Chinese words. I suppose that
                      if English, like Japanese, used Han-script, "notwithstanding" might be
                      written not-against-stay-now with four glyphs? But I'm daydreaming.

                      >
                      > The difficulties may be that 'iskeyword' is a whitelist, not a
                      > blacklist, we cannot easily blacklist a single Unicode character in
                      > 'iskeyword' without knowing *all* the Unicode characters which matches
                      > iswalpha().

                      A more important difficulty is that 'iskeyword' applies only to Unicode
                      codepoints U+0000 to U+007F when 'encoding' is UTF-8 (or any Unicode
                      value aliased to UTF-8 for internal memory), and to characters 0x00 to
                      0xFF when it isn't. Otherwise we might perhaps use ":setlocal isk-=不
                      isk-=之" or some such. This would also mean several arrays of 2 gigabits
                      rather than 256 bits to remember the settings (Vim treats the Unicode
                      range as 0 to 7FFFFFFF. Even if it limited itself to the current
                      official maximum of 10FFFD it would still mean a big increase.)

                      >
                      > Perhaps the simplest approach is to add an option 'isnkeyword' which
                      > supports any Unicode character and we can blacklist some Unicode
                      > characters while still retain the 'iskeyword' option functioning.

                      Hm. Don't know if Bram would accept that, but you can always try to
                      publish (and maintain) an unofficial patch to the C source. Don't know
                      how easy (and foolproof) it would be. For a single option, a has()
                      feature might be useful but it's less needed than for a whole batch of
                      them: we would always be able to test ":if exists('+isnkeyword')".


                      Best regards,
                      Tony.
                      --
                      A truly wise man never plays leapfrog with a unicorn.

                      --~--~---------~--~----~------------~-------~--~----~
                      You received this message from the "vim_use" maillist.
                      For more information, visit http://www.vim.org/maillist.php
                      -~----------~----~----~----~------~----~------~--~---
                    • Tony Mechelynck
                      ... Actually Vim uses a different method (a table of ranges, I think) for Unicode codepoints which require two or more UTF-8 bytes, since we ve established
                      Message 10 of 17 , Jan 4, 2009
                      • 0 Attachment
                        On 04/01/09 08:10, bill lam wrote:
                        > On Sun, 04 Jan 2009, pansz wrote:
                        >> Interesting, I see the wide punctuation characters are recognized, so
                        >> vim is using wide character internally, and omitting some particular
                        >> wide-character from 'iskeyword' shouldn't be hard.
                        >>
                        >> Then why the 'iskeyword' supports only characters from 0-255?
                        >
                        > Just wild guess since I've never looked into vim's source code. I
                        > think that iskeyword or spellcheck for that matter use FSM to
                        > implement the parser. It's ok to have a table of 256 characters but
                        > not so easy to work with a table of millions of unicode characters.
                        > A quick and dirty workaround is to coerce all non 8-bit characters as
                        > white space.
                        >

                        Actually Vim uses a different method (a table of ranges, I think) for
                        Unicode codepoints which require two or more UTF-8 bytes, since we've
                        established that fullwith comma and fullwidth fullstop are (properly)
                        recognized as breaking "word" selection, and that "ordinary" hanzi aren't.

                        Best regards,
                        Tony.
                        --
                        Hippogriff, n.:
                        An animal (now extinct) which was half horse and half griffin.
                        The griffin was itself a compound creature, half lion and half eagle.
                        The hippogriff was actually, therefore, only one quarter eagle, which
                        is two dollars and fifty cents in gold. The study of zoology is full
                        of surprises.
                        -- Ambrose Bierce, "The Devil's Dictionary"

                        --~--~---------~--~----~------------~-------~--~----~
                        You received this message from the "vim_use" maillist.
                        For more information, visit http://www.vim.org/maillist.php
                        -~----------~----~----~----~------~----~------~--~---
                      Your message has been successfully submitted and would be delivered to recipients shortly.