Loading ...
Sorry, an error occurred while loading the content.

Re: Match word containing characters beyond a-zA-Z

Expand Messages
  • Andy Wokula
    ... The word motion w moves over those characters. ... Also, [:upper:] and [:lower:] include more characters. Try / c[[:lower:]] + to match lower *and*
    Message 1 of 7 , Jan 7, 2013
    • 0 Attachment
      Am 07.01.2013 10:51, schrieb Marco:
      > I'm trying to match words containing characters beyond a-zA-Z. The
      > problem is that words like
      >
      > prästgården
      > treść
      >
      > are not recognized as words. If I match \v(\w+) on these words,
      > prästgården is matched three times and treść is matched only at the
      > beginning:
      >
      > prästgården
      > ^^ ^^^ ^^^^
      > treść
      > ^^^
      >
      > So the problem is that characters like å are not recognized as a
      > character. Checking the words with [:alpha:] proves this, it does
      > not match any of the characters åść. :h regex tells me that
      > [:alpha:] matches *letters*. For me å is a letter, not so for vim.
      >
      > How to convince vim to treat characters like åść as letters? On [1]
      > it was suggested to resort to Perl. But I can hardly believe that
      > it's not possible natively in vim.
      >
      > [1] http://unix.stackexchange.com/questions/60481/match-word-containing-characters-beyond-a-za-z

      The word motion w moves over those characters.
      :h w
      :h word
      :h 'isk
      :h /\k

      Also, [:upper:] and [:lower:] include more characters. Try
      /\c[[:lower:]]\+

      to match lower *and* upper characters.

      --
      Andy

      --
      You received this message from the "vim_use" maillist.
      Do not top-post! Type your reply below the text you are replying to.
      For more information, visit http://www.vim.org/maillist.php
    • Marco
      ... But what does that mean? The w motion recognises å as a letter, the regex w does not. w moves a *word* forward. A word is: A word consists of a sequence
      Message 2 of 7 , Jan 7, 2013
      • 0 Attachment
        On 2013–01–07 Andy Wokula wrote:

        > The word motion w moves over those characters.
        > :h w
        > :h word

        But what does that mean? The w motion recognises å as a
        letter, the regex \w does not.

        w moves a *word* forward. A word is:

        A word consists of a sequence of letters, digits and underscores,
        or a sequence of other non-blank characters, separated with white
        space (spaces, tabs, <EOL>). This can be changed with the
        'iskeyword' option.

        My iskeyword setting is:

        iskeyword=@,48-57,_,192-255

        Why does w move over the word treść? The letters ś and ć are not
        considered to be a letter, right? iskeyword lists range 192-255. If
        I hit ga on ś and ć, it reports the codes 347 and 263.

        > :h 'isk
        > :h /\k

        \k seems to work. Is it safe to replace all occurrences of \w with
        \k? That seems to be the easiest solution. And I still don't
        understand *why* it works, since 347 seems to be out of range for
        iskeyword.

        > Also, [:upper:] and [:lower:] include more characters. Try
        > /\c[[:lower:]]\+

        This works for å and ä but fails on ś and ć.


        Marco
      • Christian Brabandt
        ... It still doesn t match úã (Is this a bug?). However, you could possibly use equivalence classes like /[[=s=][=c=][=a=]], which you can read about at :h
        Message 3 of 7 , Jan 7, 2013
        • 0 Attachment
          On Mon, January 7, 2013 11:12, Andy Wokula wrote:
          > Am 07.01.2013 10:51, schrieb Marco:
          >> I'm trying to match words containing characters beyond a-zA-Z. The
          >> problem is that words like
          >>
          >> prästgården
          >> treúã
          >>
          >> are not recognized as words. If I match \v(\w+) on these words,
          >> prästgården is matched three times and treúã is matched only at the
          >> beginning:
          >>
          >> prästgården
          >> ^^ ^^^ ^^^^
          >> treúã
          >> ^^^
          >>
          >> So the problem is that characters like å are not recognized as a
          >> character. Checking the words with [:alpha:] proves this, it does
          >> not match any of the characters åúã. :h regex tells me that
          >> [:alpha:] matches *letters*. For me å is a letter, not so for vim.
          >>
          >> How to convince vim to treat characters like åúã as letters? On [1]
          >> it was suggested to resort to Perl. But I can hardly believe that
          >> it's not possible natively in vim.
          >>
          >> [1]
          >> http://unix.stackexchange.com/questions/60481/match-word-containing-characters-beyond-a-za-z
          >
          > The word motion w moves over those characters.
          > :h w
          > :h word
          > :h 'isk
          > :h /\k
          >
          > Also, [:upper:] and [:lower:] include more characters. Try
          > /\c[[:lower:]]\+
          >
          > to match lower *and* upper characters.
          >

          It still doesn't match úã (Is this a bug?).

          However, you could possibly use equivalence
          classes like /[[=s=][=c=][=a=]], which you can read about at :h /[[=

          regards,
          Christian

          --
          You received this message from the "vim_use" maillist.
          Do not top-post! Type your reply below the text you are replying to.
          For more information, visit http://www.vim.org/maillist.php
        • Marco
          ... This wreaks havoc! dots, dashes, brackets, everything is matched. I could reduce iskeyword to include only 0-9a-zA-Z_ and all accented characters. That
          Message 4 of 7 , Jan 7, 2013
          • 0 Attachment
            On 2013–01–07 Marco wrote:

            > \k seems to work. Is it safe to replace all occurrences of \w with
            > \k? That seems to be the easiest solution.

            This wreaks havoc! dots, dashes, brackets, everything is matched. I
            could reduce iskeyword to include only 0-9a-zA-Z_ and all accented
            characters. That would be some work to figure out what exactly to
            put into iskeyword but it might work.

            However, the w motion has the correct understanding of what a word
            is. Where does w get the information from? Why does w consider ć to
            be a letter, but not -? I fail to see why w (the motion) and \k (the
            regex) behave different.

            Marco
          • Marco
            ... I only managed to match single characters, not *every* character. [=a=] matches äåa, etc. What would be the syntax to match *all characters including all
            Message 5 of 7 , Jan 7, 2013
            • 0 Attachment
              On 2013–01–07 Christian Brabandt wrote:

              > However, you could possibly use equivalence
              > classes like /[[=s=][=c=][=a=]], which you can read about at :h /[[=

              I only managed to match single characters, not *every* character.
              [=a=] matches äåa, etc. What would be the syntax to match *all
              characters including all accented characters and all characters not
              having an accent but still being a character (e.g. ß)*?

              Marco
            • Marco
              ... grep matches ś, vim does not: echo ś | grep [[:lower:]] ś abcśdef /[[:lower:]] ^^^ ^^^ There definitely seems to be something wrong. Marco
              Message 6 of 7 , Jan 7, 2013
              • 0 Attachment
                On 2013–01–07 Christian Brabandt wrote:

                > It still doesn't match úã (Is this a bug?).

                grep matches ś, vim does not:

                echo ś | grep '[[:lower:]]'
                ś

                abcśdef /[[:lower:]]
                ^^^ ^^^

                There definitely seems to be something wrong.

                Marco
              Your message has been successfully submitted and would be delivered to recipients shortly.