Loading ...
Sorry, an error occurred while loading the content.

Re: [Clip] \W and underscore

Expand Messages
  • Axel Berger
    ... Yes, it s wrong. w (small letter) already includes _, so [^ w_] and [^ w] are identical. It s W (big letter) that you need to add the _ to. Axel
    Message 1 of 15 , Mar 16, 2013
    View Source
    • 0 Attachment
      John Shotsky wrote:
      > No, [^\w_]

      Yes, it's wrong. \w (small letter) already includes _, so [^\w_] and
      [^\w] are identical. It's \W (big letter) that you need to add the _ to.

      Axel
    • John Shotsky
      I would rather rename all underscores in the beginning to avoid this problem than have to convert my whole library to use that nomenclature when all I want is
      Message 2 of 15 , Mar 16, 2013
      View Source
      • 0 Attachment
        I would rather rename all underscores in the beginning to avoid this problem than have to convert my whole library to use that
        nomenclature when all I want is for \w to work as it should. For example, I could convert the underscores to [_] (including the
        brackets) and then all would work as expected.

        PCRE and Perl are already different, I would rather see this cleared up than leave it in place because, uh, that's the way we've
        always done it.

        Regards,
        John
        RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
        John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

        From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of flo.gehrke
        Sent: Saturday, March 16, 2013 20:18
        To: ntb-clips@yahoogroups.com
        Subject: [Clip] Re: \W and underscore


        --- In ntb-clips@yahoogroups.com <mailto:ntb-clips%40yahoogroups.com> , "John Shotsky" <jshotsky@...> wrote:
        >
        > Yet again, I've learned the hard way that \w, which is supposed
        > to mean letters and numbers, includes the underscore [_],(...)
        > This presents a bit of a problem when one uses \w virtually
        > everywhere expecting it to only pertain to actual letters and
        > numbers.

        What about the POSIX Character Class '[[:alnum:]]'? It matches numbers and letters (including characters with diacritics) but not
        the underscore.

        > I DO hope the developers of PCRE will address this problem.

        It's a rule that goes back to the history of Perl. So, probably, the PCRE developers won't feel affected by this issue.

        Flo



        [Non-text portions of this message have been removed]
      • flo.gehrke
        ... That s not quite correct. With POSIX Character Classes, you can combine positive and negative definitions to some extent. Example: [0[:^digit:]] will
        Message 3 of 15 , Mar 16, 2013
        View Source
        • 0 Attachment
          --- In ntb-clips@yahoogroups.com, Axel Berger <Axel-Berger@...> wrote:
          >
          > By the way I just reread the help. A class can either be all
          > positive or all negative...

          That's not quite correct. With POSIX Character Classes, you can combine positive and negative definitions to some extent.

          Example: '[0[:^digit:]]' will match zero and any character that is no digit.

          Flo
        • John Shotsky
          That is useful. I will have to document that for myself. I have a clip called my notes in which I keep all these gems. That s where I noticed this original
          Message 4 of 15 , Mar 16, 2013
          View Source
          • 0 Attachment
            That is useful. I will have to document that for myself. I have a clip called 'my notes' in which I keep all these gems. That's
            where I noticed this original problem was already documented. Duh.

            Regards,
            John
            RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
            John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

            From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of flo.gehrke
            Sent: Saturday, March 16, 2013 20:43
            To: ntb-clips@yahoogroups.com
            Subject: Re: [Clip] \W and underscore


            --- In ntb-clips@yahoogroups.com <mailto:ntb-clips%40yahoogroups.com> , Axel Berger <Axel-Berger@...> wrote:
            >
            > By the way I just reread the help. A class can either be all
            > positive or all negative...

            That's not quite correct. With POSIX Character Classes, you can combine positive and negative definitions to some extent.

            Example: '[0[:^digit:]]' will match zero and any character that is no digit.

            Flo



            [Non-text portions of this message have been removed]
          • Don
            I think it should be as is. A _ is not a word boundary ... it is used to join the words. As Flo points out, they gave you a solution and as Axel points out
            Message 5 of 15 , Mar 16, 2013
            View Source
            • 0 Attachment
              I think it should be as is. A _ is not a word boundary ... it is used
              to join the words.

              As Flo points out, they gave you a solution and as Axel points out there
              is another easy solution. If they happened to conclude that you were
              right, that would require all manner of recoding ... which is what you
              are disinclined to do here for your libraries apparently and yet the
              entire world would have to do so if your thought carries the day.

              I'd say it matters not what we think, because as Flo says, it has roots
              in Perl.


              On 3/16/2013 11:26 PM, John Shotsky wrote:
              > I would rather rename all underscores in the beginning to avoid this problem than have to convert my whole library to use that
              > nomenclature when all I want is for \w to work as it should. For example, I could convert the underscores to [_] (including the
              > brackets) and then all would work as expected.
              >
            • John Shotsky
              Yet it works with b as a word boundary. If it is treated as a word boundary, it is NOT being treated as a letter or number in THAT case. That is, a b detects
              Message 6 of 15 , Mar 16, 2013
              View Source
              • 0 Attachment
                Yet it works with \b as a word boundary. If it is treated as a word boundary, it is NOT being treated as a letter or number in THAT
                case. That is, a \b detects that a word ends, but \w includes the [_]. I don't care about history � PCRE is already different than
                Perl. It is not selfish to think that \w, which is defined as all letters and numbers, should actually BE all numbers and letters
                AND NOT the underscore. Nowhere else, in all of PCRE (as far as I know) does a non-letter and non-number count as a letter or a
                number. That is just wrong.

                Regards,
                John
                RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
                John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

                From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Don
                Sent: Saturday, March 16, 2013 21:00
                To: ntb-clips@yahoogroups.com
                Subject: Re: [Clip] Re: \W and underscore


                I think it should be as is. A _ is not a word boundary ... it is used
                to join the words.

                As Flo points out, they gave you a solution and as Axel points out there
                is another easy solution. If they happened to conclude that you were
                right, that would require all manner of recoding ... which is what you
                are disinclined to do here for your libraries apparently and yet the
                entire world would have to do so if your thought carries the day.

                I'd say it matters not what we think, because as Flo says, it has roots
                in Perl.

                On 3/16/2013 11:26 PM, John Shotsky wrote:
                > I would rather rename all underscores in the beginning to avoid this problem than have to convert my whole library to use that
                > nomenclature when all I want is for \w to work as it should. For example, I could convert the underscores to [_] (including the
                > brackets) and then all would work as expected.
                >



                [Non-text portions of this message have been removed]
              • flo.gehrke
                ... This is misleading. A single character like the underscore can never be represent a word boundary. b is an assertion that matches at a position where a
                Message 7 of 15 , Mar 16, 2013
                View Source
                • 0 Attachment
                  --- In ntb-clips@yahoogroups.com, "John Shotsky" <jshotsky@...> wrote:
                  >
                  > This also complicates the use of \b for word boundaries,
                  > because \b DOES treat this character as a word boundary.

                  This is misleading. A single character like the underscore can never be represent a word boundary. '\b' is an assertion that matches at a position where a non-word character is preceded resp. followed by a word character. Thus it signifies a position of zero length and no single character.

                  As discussed here, the underscore is defined as a normal word character. So '\bJohn' doesn't match the string 'aaa _John', for example, because 'John' is not preceded by a word boundary in this case.

                  Flo
                • Axel Berger
                  ... You re absolutely right. I had taken John by his word and not tested this. In the text aaabbbccc aaa bbbccc aaabbb ccc aaa bbb ccc aaa_bbbccc aaabbb_ccc
                  Message 8 of 15 , Mar 16, 2013
                  View Source
                  • 0 Attachment
                    "flo.gehrke" wrote:
                    > As discussed here, the underscore is defined as a normal word character.

                    You're absolutely right. I had taken John by his word and not tested
                    this.

                    In the text

                    aaabbbccc
                    aaa bbbccc aaabbb ccc aaa bbb ccc
                    aaa_bbbccc aaabbb_ccc aaa_bbb_ccc
                    aaa _bbb ccc aaa bbb_ ccc aaa _bbb_ ccc
                    aaa_ bbb_ccc aaa_bbb _ccc aaa_ bbb _ccc

                    the pattern "\bbbb\b" (b was a bad letter choice in hindsight) matches
                    the last string in the second and in the fifth line, nothing else.

                    Axel
                  • John Shotsky
                    You re right, I was not paying attention. It was selecting the last character, which was the underscore and the boundary was the following character. If you do
                    Message 9 of 15 , Mar 17, 2013
                    View Source
                    • 0 Attachment
                      You're right, I was not paying attention. It was selecting the last character, which was the underscore and the boundary was the
                      following character. If you do your test with a space following the underscore, you will see what I mean.

                      Regards,
                      John
                      RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
                      John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

                      From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Axel Berger
                      Sent: Saturday, March 16, 2013 23:51
                      To: ntb-clips@yahoogroups.com
                      Subject: Re: [Clip] Re: \W and underscore


                      "flo.gehrke" wrote:
                      > As discussed here, the underscore is defined as a normal word character.

                      You're absolutely right. I had taken John by his word and not tested
                      this.

                      In the text

                      aaabbbccc
                      aaa bbbccc aaabbb ccc aaa bbb ccc
                      aaa_bbbccc aaabbb_ccc aaa_bbb_ccc
                      aaa _bbb ccc aaa bbb_ ccc aaa _bbb_ ccc
                      aaa_ bbb_ccc aaa_bbb _ccc aaa_ bbb _ccc

                      the pattern "\bbbb\b" (b was a bad letter choice in hindsight) matches
                      the last string in the second and in the fifth line, nothing else.

                      Axel



                      [Non-text portions of this message have been removed]
                    Your message has been successfully submitted and would be delivered to recipients shortly.