Loading ...
Sorry, an error occurred while loading the content.

Match word containing characters beyond a-zA-Z

Expand Messages
  • Marco
    I m trying to match words containing characters beyond a-zA-Z. The problem is that words like prästgården treść are not recognized as words. If I match
    Message 1 of 7 , Jan 7, 2013
    • 0 Attachment
      I'm trying to match words containing characters beyond a-zA-Z. The
      problem is that words like

      prästgården
      treść

      are not recognized as words. If I match \v(\w+) on these words,
      prästgården is matched three times and treść is matched only at the
      beginning:

      prästgården
      ^^ ^^^ ^^^^
      treść
      ^^^

      So the problem is that characters like å are not recognized as a
      character. Checking the words with [:alpha:] proves this, it does
      not match any of the characters åść. :h regex tells me that
      [:alpha:] matches *letters*. For me å is a letter, not so for vim.

      How to convince vim to treat characters like åść as letters? On [1]
      it was suggested to resort to Perl. But I can hardly believe that
      it's not possible natively in vim.

      VIM - Vi IMproved 7.3 (2010 Aug 15, compiled Dec 27 2012 21:21:18)
      Included patches: 1-762
      Debian GNU/Linux
      LANG=en_GB.UTF-8

      Marco


      [1] http://unix.stackexchange.com/questions/60481/match-word-containing-characters-beyond-a-za-z
    • Andy Wokula
      ... The word motion w moves over those characters. ... Also, [:upper:] and [:lower:] include more characters. Try / c[[:lower:]] + to match lower *and*
      Message 2 of 7 , Jan 7, 2013
      • 0 Attachment
        Am 07.01.2013 10:51, schrieb Marco:
        > I'm trying to match words containing characters beyond a-zA-Z. The
        > problem is that words like
        >
        > prästgården
        > treść
        >
        > are not recognized as words. If I match \v(\w+) on these words,
        > prästgården is matched three times and treść is matched only at the
        > beginning:
        >
        > prästgården
        > ^^ ^^^ ^^^^
        > treść
        > ^^^
        >
        > So the problem is that characters like å are not recognized as a
        > character. Checking the words with [:alpha:] proves this, it does
        > not match any of the characters åść. :h regex tells me that
        > [:alpha:] matches *letters*. For me å is a letter, not so for vim.
        >
        > How to convince vim to treat characters like åść as letters? On [1]
        > it was suggested to resort to Perl. But I can hardly believe that
        > it's not possible natively in vim.
        >
        > [1] http://unix.stackexchange.com/questions/60481/match-word-containing-characters-beyond-a-za-z

        The word motion w moves over those characters.
        :h w
        :h word
        :h 'isk
        :h /\k

        Also, [:upper:] and [:lower:] include more characters. Try
        /\c[[:lower:]]\+

        to match lower *and* upper characters.

        --
        Andy

        --
        You received this message from the "vim_use" maillist.
        Do not top-post! Type your reply below the text you are replying to.
        For more information, visit http://www.vim.org/maillist.php
      • Marco
        ... But what does that mean? The w motion recognises å as a letter, the regex w does not. w moves a *word* forward. A word is: A word consists of a sequence
        Message 3 of 7 , Jan 7, 2013
        • 0 Attachment
          On 2013–01–07 Andy Wokula wrote:

          > The word motion w moves over those characters.
          > :h w
          > :h word

          But what does that mean? The w motion recognises å as a
          letter, the regex \w does not.

          w moves a *word* forward. A word is:

          A word consists of a sequence of letters, digits and underscores,
          or a sequence of other non-blank characters, separated with white
          space (spaces, tabs, <EOL>). This can be changed with the
          'iskeyword' option.

          My iskeyword setting is:

          iskeyword=@,48-57,_,192-255

          Why does w move over the word treść? The letters ś and ć are not
          considered to be a letter, right? iskeyword lists range 192-255. If
          I hit ga on ś and ć, it reports the codes 347 and 263.

          > :h 'isk
          > :h /\k

          \k seems to work. Is it safe to replace all occurrences of \w with
          \k? That seems to be the easiest solution. And I still don't
          understand *why* it works, since 347 seems to be out of range for
          iskeyword.

          > Also, [:upper:] and [:lower:] include more characters. Try
          > /\c[[:lower:]]\+

          This works for å and ä but fails on ś and ć.


          Marco
        • Christian Brabandt
          ... It still doesn t match úã (Is this a bug?). However, you could possibly use equivalence classes like /[[=s=][=c=][=a=]], which you can read about at :h
          Message 4 of 7 , Jan 7, 2013
          • 0 Attachment
            On Mon, January 7, 2013 11:12, Andy Wokula wrote:
            > Am 07.01.2013 10:51, schrieb Marco:
            >> I'm trying to match words containing characters beyond a-zA-Z. The
            >> problem is that words like
            >>
            >> prästgården
            >> treúã
            >>
            >> are not recognized as words. If I match \v(\w+) on these words,
            >> prästgården is matched three times and treúã is matched only at the
            >> beginning:
            >>
            >> prästgården
            >> ^^ ^^^ ^^^^
            >> treúã
            >> ^^^
            >>
            >> So the problem is that characters like å are not recognized as a
            >> character. Checking the words with [:alpha:] proves this, it does
            >> not match any of the characters åúã. :h regex tells me that
            >> [:alpha:] matches *letters*. For me å is a letter, not so for vim.
            >>
            >> How to convince vim to treat characters like åúã as letters? On [1]
            >> it was suggested to resort to Perl. But I can hardly believe that
            >> it's not possible natively in vim.
            >>
            >> [1]
            >> http://unix.stackexchange.com/questions/60481/match-word-containing-characters-beyond-a-za-z
            >
            > The word motion w moves over those characters.
            > :h w
            > :h word
            > :h 'isk
            > :h /\k
            >
            > Also, [:upper:] and [:lower:] include more characters. Try
            > /\c[[:lower:]]\+
            >
            > to match lower *and* upper characters.
            >

            It still doesn't match úã (Is this a bug?).

            However, you could possibly use equivalence
            classes like /[[=s=][=c=][=a=]], which you can read about at :h /[[=

            regards,
            Christian

            --
            You received this message from the "vim_use" maillist.
            Do not top-post! Type your reply below the text you are replying to.
            For more information, visit http://www.vim.org/maillist.php
          • Marco
            ... This wreaks havoc! dots, dashes, brackets, everything is matched. I could reduce iskeyword to include only 0-9a-zA-Z_ and all accented characters. That
            Message 5 of 7 , Jan 7, 2013
            • 0 Attachment
              On 2013–01–07 Marco wrote:

              > \k seems to work. Is it safe to replace all occurrences of \w with
              > \k? That seems to be the easiest solution.

              This wreaks havoc! dots, dashes, brackets, everything is matched. I
              could reduce iskeyword to include only 0-9a-zA-Z_ and all accented
              characters. That would be some work to figure out what exactly to
              put into iskeyword but it might work.

              However, the w motion has the correct understanding of what a word
              is. Where does w get the information from? Why does w consider ć to
              be a letter, but not -? I fail to see why w (the motion) and \k (the
              regex) behave different.

              Marco
            • Marco
              ... I only managed to match single characters, not *every* character. [=a=] matches äåa, etc. What would be the syntax to match *all characters including all
              Message 6 of 7 , Jan 7, 2013
              • 0 Attachment
                On 2013–01–07 Christian Brabandt wrote:

                > However, you could possibly use equivalence
                > classes like /[[=s=][=c=][=a=]], which you can read about at :h /[[=

                I only managed to match single characters, not *every* character.
                [=a=] matches äåa, etc. What would be the syntax to match *all
                characters including all accented characters and all characters not
                having an accent but still being a character (e.g. ß)*?

                Marco
              • Marco
                ... grep matches ś, vim does not: echo ś | grep [[:lower:]] ś abcśdef /[[:lower:]] ^^^ ^^^ There definitely seems to be something wrong. Marco
                Message 7 of 7 , Jan 7, 2013
                • 0 Attachment
                  On 2013–01–07 Christian Brabandt wrote:

                  > It still doesn't match úã (Is this a bug?).

                  grep matches ś, vim does not:

                  echo ś | grep '[[:lower:]]'
                  ś

                  abcśdef /[[:lower:]]
                  ^^^ ^^^

                  There definitely seems to be something wrong.

                  Marco
                Your message has been successfully submitted and would be delivered to recipients shortly.