Loading ...
Sorry, an error occurred while loading the content.

\W and underscore

Expand Messages
  • John Shotsky
    Yet again, I ve learned the hard way that w, which is supposed to mean letters and numbers, includes the underscore [_], that being the character that often
    Message 1 of 15 , Mar 16, 2013
    • 0 Attachment
      Yet again, I've learned the hard way that \w, which is supposed to mean letters and numbers, includes the underscore [_], that being
      the character that often replaces a space between letters and which is at the bottom level of character bodies. Some call it 'low
      line', there are other names as well. This presents a bit of a problem when one uses \w virtually everywhere expecting it to only
      pertain to actual letters and numbers. And, its opposite, \W, which you would expect to capture everything EXCEPT letters and
      numbers misses that character. At this point, I think the only workaround is a character class such as [A-Za-z\d], yet that won't
      capture accented characters and those with diacritical marks. That makes the character class even more complicated, meaning a
      negative class such as [^\W_\r\n] to capture only letters and numbers. I guess I knew about this some time ago, but it is one of
      those things that is easily forgotten, as you really expect letters and numbers to be the ONLY things captured with \w.

      This also complicates the use of \b for word boundaries, because \b DOES treat this character as a word boundary. And, if you search
      for \w_, it will find them, as might be expected. Yet, if you search for \w+, you will find numbers, letters AND the underscore are
      captured. I DO hope the developers of PCRE will address this problem. The underscore should belong to the \W group, not the \w
      group.

      Regards,
      John Shotsky
      100 SW 195th Avenue, Unit 155
      Beaverton, Oregon, 97006
      RecipeTools Web Site: http://recipetools.gotdns.com/
      RecipeTools Yahoo Group: http://groups.yahoo.com/group/RecipeTools/
      John's Mags Yahoo Group: http://groups.yahoo.com/group/johnsmags/
      Beaverton Weather: http://shotsky.gotdns.com/index.htm



      [Non-text portions of this message have been removed]
    • Axel Berger
      ... I tend to agree. There is an easy workaround though, which I expect you re already aware of. If you stumble across this often then just write [^ W_] where
      Message 2 of 15 , Mar 16, 2013
      • 0 Attachment
        John Shotsky wrote:
        > The underscore should belong to the \W group, not the \w group.

        I tend to agree. There is an easy workaround though, which I expect
        you're already aware of. If you stumble across this often then just
        write [^\W_] where you mean \w and [\W_] where you mean \W. It's a bit
        more to type but it should solve your problems and a search and replace
        can easily (though not quickly, I'd do it carfully one by one, not as
        "all") mend your existing clips.

        By the way I just reread the help. A class can either be all positive or
        all negative, so that [\d^0] for "all digits except zero" is illegal (or
        rather resolves to all digits plus ciconflex). Shame, but one can live
        with it.

        Axel
      • John Shotsky
        Agreed, but if you care about line ends, as I usually do, you have to include r n in the negative class, or you will find yourself many paragraphs ahead of
        Message 3 of 15 , Mar 16, 2013
        • 0 Attachment
          Agreed, but if you care about line ends, as I usually do, you have to include \r\n in the negative class, or you will find yourself
          many paragraphs ahead of where you intend to be.

          Regards,
          John
          RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
          John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

          From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Axel Berger
          Sent: Saturday, March 16, 2013 19:31
          To: ntb-clips@yahoogroups.com
          Subject: Re: [Clip] \W and underscore


          John Shotsky wrote:
          > The underscore should belong to the \W group, not the \w group.

          I tend to agree. There is an easy workaround though, which I expect
          you're already aware of. If you stumble across this often then just
          write [^\W_] where you mean \w and [\W_] where you mean \W. It's a bit
          more to type but it should solve your problems and a search and replace
          can easily (though not quickly, I'd do it carfully one by one, not as
          "all") mend your existing clips.

          By the way I just reread the help. A class can either be all positive or
          all negative, so that [\d^0] for "all digits except zero" is illegal (or
          rather resolves to all digits plus ciconflex). Shame, but one can live
          with it.

          Axel



          [Non-text portions of this message have been removed]
        • Axel Berger
          ... I don t see why. w and W are complemetary, so w and [^ W] should be exactly the same. So if what you re looking for is w except _ and W plus _
          Message 4 of 15 , Mar 16, 2013
          • 0 Attachment
            John Shotsky wrote:
            > you have to include \r\n in the negative class

            I don't see why. \w and \W are complemetary, so \w and [^\W] should be
            exactly the same. So if what you're looking for is "\w except _" and "\W
            plus _" [\W_] and [^\W_] are your two solutions.
            Or are those line ends another problem on top of the _ one? If so it
            will have to be tackled too of course.

            Axel
          • flo.gehrke
            ... What about the POSIX Character Class [[:alnum:]] ? It matches numbers and letters (including characters with diacritics) but not the underscore. ... It s
            Message 5 of 15 , Mar 16, 2013
            • 0 Attachment
              --- In ntb-clips@yahoogroups.com, "John Shotsky" <jshotsky@...> wrote:
              >
              > Yet again, I've learned the hard way that \w, which is supposed
              > to mean letters and numbers, includes the underscore [_],(...)
              > This presents a bit of a problem when one uses \w virtually
              > everywhere expecting it to only pertain to actual letters and
              > numbers.

              What about the POSIX Character Class '[[:alnum:]]'? It matches numbers and letters (including characters with diacritics) but not the underscore.

              > I DO hope the developers of PCRE will address this problem.

              It's a rule that goes back to the history of Perl. So, probably, the PCRE developers won't feel affected by this issue.

              Flo
            • John Shotsky
              No, [^ w_]+ will collect line ends too. You need to use [^ w_ r n]+ to prevent it. Regards, John RecipeTools Web Site:
              Message 6 of 15 , Mar 16, 2013
              • 0 Attachment
                No, [^\w_]+ will collect line ends too. You need to use [^\w_\r\n]+ to prevent it.

                Regards,
                John
                RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
                John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

                From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Axel Berger
                Sent: Saturday, March 16, 2013 20:11
                To: ntb-clips@yahoogroups.com
                Subject: Re: [Clip] \W and underscore


                John Shotsky wrote:
                > you have to include \r\n in the negative class

                I don't see why. \w and \W are complemetary, so \w and [^\W] should be
                exactly the same. So if what you're looking for is "\w except _" and "\W
                plus _" [\W_] and [^\W_] are your two solutions.
                Or are those line ends another problem on top of the _ one? If so it
                will have to be tackled too of course.

                Axel



                [Non-text portions of this message have been removed]
              • Axel Berger
                ... Yes, it s wrong. w (small letter) already includes _, so [^ w_] and [^ w] are identical. It s W (big letter) that you need to add the _ to. Axel
                Message 7 of 15 , Mar 16, 2013
                • 0 Attachment
                  John Shotsky wrote:
                  > No, [^\w_]

                  Yes, it's wrong. \w (small letter) already includes _, so [^\w_] and
                  [^\w] are identical. It's \W (big letter) that you need to add the _ to.

                  Axel
                • John Shotsky
                  I would rather rename all underscores in the beginning to avoid this problem than have to convert my whole library to use that nomenclature when all I want is
                  Message 8 of 15 , Mar 16, 2013
                  • 0 Attachment
                    I would rather rename all underscores in the beginning to avoid this problem than have to convert my whole library to use that
                    nomenclature when all I want is for \w to work as it should. For example, I could convert the underscores to [_] (including the
                    brackets) and then all would work as expected.

                    PCRE and Perl are already different, I would rather see this cleared up than leave it in place because, uh, that's the way we've
                    always done it.

                    Regards,
                    John
                    RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
                    John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

                    From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of flo.gehrke
                    Sent: Saturday, March 16, 2013 20:18
                    To: ntb-clips@yahoogroups.com
                    Subject: [Clip] Re: \W and underscore


                    --- In ntb-clips@yahoogroups.com <mailto:ntb-clips%40yahoogroups.com> , "John Shotsky" <jshotsky@...> wrote:
                    >
                    > Yet again, I've learned the hard way that \w, which is supposed
                    > to mean letters and numbers, includes the underscore [_],(...)
                    > This presents a bit of a problem when one uses \w virtually
                    > everywhere expecting it to only pertain to actual letters and
                    > numbers.

                    What about the POSIX Character Class '[[:alnum:]]'? It matches numbers and letters (including characters with diacritics) but not
                    the underscore.

                    > I DO hope the developers of PCRE will address this problem.

                    It's a rule that goes back to the history of Perl. So, probably, the PCRE developers won't feel affected by this issue.

                    Flo



                    [Non-text portions of this message have been removed]
                  • flo.gehrke
                    ... That s not quite correct. With POSIX Character Classes, you can combine positive and negative definitions to some extent. Example: [0[:^digit:]] will
                    Message 9 of 15 , Mar 16, 2013
                    • 0 Attachment
                      --- In ntb-clips@yahoogroups.com, Axel Berger <Axel-Berger@...> wrote:
                      >
                      > By the way I just reread the help. A class can either be all
                      > positive or all negative...

                      That's not quite correct. With POSIX Character Classes, you can combine positive and negative definitions to some extent.

                      Example: '[0[:^digit:]]' will match zero and any character that is no digit.

                      Flo
                    • John Shotsky
                      That is useful. I will have to document that for myself. I have a clip called my notes in which I keep all these gems. That s where I noticed this original
                      Message 10 of 15 , Mar 16, 2013
                      • 0 Attachment
                        That is useful. I will have to document that for myself. I have a clip called 'my notes' in which I keep all these gems. That's
                        where I noticed this original problem was already documented. Duh.

                        Regards,
                        John
                        RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
                        John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

                        From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of flo.gehrke
                        Sent: Saturday, March 16, 2013 20:43
                        To: ntb-clips@yahoogroups.com
                        Subject: Re: [Clip] \W and underscore


                        --- In ntb-clips@yahoogroups.com <mailto:ntb-clips%40yahoogroups.com> , Axel Berger <Axel-Berger@...> wrote:
                        >
                        > By the way I just reread the help. A class can either be all
                        > positive or all negative...

                        That's not quite correct. With POSIX Character Classes, you can combine positive and negative definitions to some extent.

                        Example: '[0[:^digit:]]' will match zero and any character that is no digit.

                        Flo



                        [Non-text portions of this message have been removed]
                      • Don
                        I think it should be as is. A _ is not a word boundary ... it is used to join the words. As Flo points out, they gave you a solution and as Axel points out
                        Message 11 of 15 , Mar 16, 2013
                        • 0 Attachment
                          I think it should be as is. A _ is not a word boundary ... it is used
                          to join the words.

                          As Flo points out, they gave you a solution and as Axel points out there
                          is another easy solution. If they happened to conclude that you were
                          right, that would require all manner of recoding ... which is what you
                          are disinclined to do here for your libraries apparently and yet the
                          entire world would have to do so if your thought carries the day.

                          I'd say it matters not what we think, because as Flo says, it has roots
                          in Perl.


                          On 3/16/2013 11:26 PM, John Shotsky wrote:
                          > I would rather rename all underscores in the beginning to avoid this problem than have to convert my whole library to use that
                          > nomenclature when all I want is for \w to work as it should. For example, I could convert the underscores to [_] (including the
                          > brackets) and then all would work as expected.
                          >
                        • John Shotsky
                          Yet it works with b as a word boundary. If it is treated as a word boundary, it is NOT being treated as a letter or number in THAT case. That is, a b detects
                          Message 12 of 15 , Mar 16, 2013
                          • 0 Attachment
                            Yet it works with \b as a word boundary. If it is treated as a word boundary, it is NOT being treated as a letter or number in THAT
                            case. That is, a \b detects that a word ends, but \w includes the [_]. I don't care about history � PCRE is already different than
                            Perl. It is not selfish to think that \w, which is defined as all letters and numbers, should actually BE all numbers and letters
                            AND NOT the underscore. Nowhere else, in all of PCRE (as far as I know) does a non-letter and non-number count as a letter or a
                            number. That is just wrong.

                            Regards,
                            John
                            RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
                            John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

                            From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Don
                            Sent: Saturday, March 16, 2013 21:00
                            To: ntb-clips@yahoogroups.com
                            Subject: Re: [Clip] Re: \W and underscore


                            I think it should be as is. A _ is not a word boundary ... it is used
                            to join the words.

                            As Flo points out, they gave you a solution and as Axel points out there
                            is another easy solution. If they happened to conclude that you were
                            right, that would require all manner of recoding ... which is what you
                            are disinclined to do here for your libraries apparently and yet the
                            entire world would have to do so if your thought carries the day.

                            I'd say it matters not what we think, because as Flo says, it has roots
                            in Perl.

                            On 3/16/2013 11:26 PM, John Shotsky wrote:
                            > I would rather rename all underscores in the beginning to avoid this problem than have to convert my whole library to use that
                            > nomenclature when all I want is for \w to work as it should. For example, I could convert the underscores to [_] (including the
                            > brackets) and then all would work as expected.
                            >



                            [Non-text portions of this message have been removed]
                          • flo.gehrke
                            ... This is misleading. A single character like the underscore can never be represent a word boundary. b is an assertion that matches at a position where a
                            Message 13 of 15 , Mar 16, 2013
                            • 0 Attachment
                              --- In ntb-clips@yahoogroups.com, "John Shotsky" <jshotsky@...> wrote:
                              >
                              > This also complicates the use of \b for word boundaries,
                              > because \b DOES treat this character as a word boundary.

                              This is misleading. A single character like the underscore can never be represent a word boundary. '\b' is an assertion that matches at a position where a non-word character is preceded resp. followed by a word character. Thus it signifies a position of zero length and no single character.

                              As discussed here, the underscore is defined as a normal word character. So '\bJohn' doesn't match the string 'aaa _John', for example, because 'John' is not preceded by a word boundary in this case.

                              Flo
                            • Axel Berger
                              ... You re absolutely right. I had taken John by his word and not tested this. In the text aaabbbccc aaa bbbccc aaabbb ccc aaa bbb ccc aaa_bbbccc aaabbb_ccc
                              Message 14 of 15 , Mar 16, 2013
                              • 0 Attachment
                                "flo.gehrke" wrote:
                                > As discussed here, the underscore is defined as a normal word character.

                                You're absolutely right. I had taken John by his word and not tested
                                this.

                                In the text

                                aaabbbccc
                                aaa bbbccc aaabbb ccc aaa bbb ccc
                                aaa_bbbccc aaabbb_ccc aaa_bbb_ccc
                                aaa _bbb ccc aaa bbb_ ccc aaa _bbb_ ccc
                                aaa_ bbb_ccc aaa_bbb _ccc aaa_ bbb _ccc

                                the pattern "\bbbb\b" (b was a bad letter choice in hindsight) matches
                                the last string in the second and in the fifth line, nothing else.

                                Axel
                              • John Shotsky
                                You re right, I was not paying attention. It was selecting the last character, which was the underscore and the boundary was the following character. If you do
                                Message 15 of 15 , Mar 17, 2013
                                • 0 Attachment
                                  You're right, I was not paying attention. It was selecting the last character, which was the underscore and the boundary was the
                                  following character. If you do your test with a space following the underscore, you will see what I mean.

                                  Regards,
                                  John
                                  RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
                                  John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

                                  From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Axel Berger
                                  Sent: Saturday, March 16, 2013 23:51
                                  To: ntb-clips@yahoogroups.com
                                  Subject: Re: [Clip] Re: \W and underscore


                                  "flo.gehrke" wrote:
                                  > As discussed here, the underscore is defined as a normal word character.

                                  You're absolutely right. I had taken John by his word and not tested
                                  this.

                                  In the text

                                  aaabbbccc
                                  aaa bbbccc aaabbb ccc aaa bbb ccc
                                  aaa_bbbccc aaabbb_ccc aaa_bbb_ccc
                                  aaa _bbb ccc aaa bbb_ ccc aaa _bbb_ ccc
                                  aaa_ bbb_ccc aaa_bbb _ccc aaa_ bbb _ccc

                                  the pattern "\bbbb\b" (b was a bad letter choice in hindsight) matches
                                  the last string in the second and in the fifth line, nothing else.

                                  Axel



                                  [Non-text portions of this message have been removed]
                                Your message has been successfully submitted and would be delivered to recipients shortly.