Loading ...
Sorry, an error occurred while loading the content.

Problem with {0,4} quantifier

Expand Messages
  • flo.gehrke
    I wonder how NT is dealing with a quantifier like {0,4}. There s no problem to match the string 2011 with... ^!Find d{0,4} RS when starting from the
    Message 1 of 11 , Feb 16, 2011
    • 0 Attachment
      I wonder how NT is dealing with a quantifier like {0,4}.

      There's no problem to match the string '2011' with...

      ^!Find "\d{0,4} RS

      when starting from the beginning of the line, and '2011' being placed at that position.

      But the same clip won't match the string 'year 2011' when starting from the same position. The cursor seems to be stuck on the beginning of the line. It's the same issue with '\d*' or '\d{0,}'. Why this?

      There's no problem if the pattern matches at any other position. Again, in the string 'year 2011' the substring '2011' will, for example, be matched with...

      ^!Find "^[^\d]+\K\d{0,4}" RS

      It might be interesting to watch what happens when running...

      ^!Info ^$GetDocMatchAll("\d{0,4}")$

      against that string. The infobox displays ';;;;;2011'. That is, at each position in 'year ', the RegEx achieves a "zero hit". (I don't know if that's the correct term; in German, we talk of "Nulltreffer", i.e. a match that doesn't consume any character.) But why should a "zero hit" prevent the RegEx engine from moving forward as shown above?

      Thanks for any light you can shed on this!

      Flo
    • diodeom
      ... I believe it s by no design flaw, Flo. :) I d consider that -- unlike GetDocMatchAll (or Replace All) -- Find operates in singular contexts and keeps no
      Message 2 of 11 , Feb 18, 2011
      • 0 Attachment
        --- In ntb-clips@yahoogroups.com, "flo.gehrke" <flo.gehrke@...> wrote:
        >
        > I wonder how NT is dealing with a quantifier like {0,4}.
        >
        > There's no problem to match the string '2011' with...
        >
        > ^!Find "\d{0,4} RS
        >
        > when starting from the beginning of the line, and '2011' being placed at that position.
        >
        > But the same clip won't match the string 'year 2011' when starting from the same position. The cursor seems to be stuck on the beginning of the line. It's the same issue with '\d*' or '\d{0,}'. Why this?
        >
        > There's no problem if the pattern matches at any other position. Again, in the string 'year 2011' the substring '2011' will, for example, be matched with...
        >
        > ^!Find "^[^\d]+\K\d{0,4}" RS
        >
        > It might be interesting to watch what happens when running...
        >
        > ^!Info ^$GetDocMatchAll("\d{0,4}")$
        >
        > against that string. The infobox displays ';;;;;2011'. That is, at each position in 'year ', the RegEx achieves a "zero hit". (I don't know if that's the correct term; in German, we talk of "Nulltreffer", i.e. a match that doesn't consume any character.) But why should a "zero hit" prevent the RegEx engine from moving forward as shown above?
        >
        > Thanks for any light you can shed on this!
        >
        > Flo
        >


        I believe it's by no design flaw, Flo. :) I'd consider that -- unlike GetDocMatchAll (or Replace All) -- Find operates in singular contexts and keeps no tally of prior matches, its only reference being -- one instance at the time -- a current cursor position (or selection). Given the zero-size earlier match as the starting point of its subsequent execution, Find has nowhere to advance while it still satisfies the pattern locating no (or up to four) digits.
      • flo.gehrke
        ... diodeom, Thanks for your reply! Nevertheless, it s still a confusing issue for me because, so far, I do not see the logic behind that. Here are some more
        Message 3 of 11 , Feb 18, 2011
        • 0 Attachment
          --- In ntb-clips@yahoogroups.com, "diodeom" <diomir@...> wrote:
          >
          > I believe it's by no design flaw, Flo. :) I'd consider that --
          > unlike GetDocMatchAll (or Replace All) -- Find operates in
          > singular contexts...

          diodeom,

          Thanks for your reply! Nevertheless, it's still a confusing issue for me because, so far, I do not see the logic behind that. Here are some more observations:

          (1) Is it a matter of '^!Find'? Or could it have to do with the specific RegEx flavor and the type of RegEx Engine that's being used with NT?

          When testing '\d{0,4}' against the string 'year 2011', there are different results, for example, with...

          http://regexpal.com --> immediately matching '2011' in that string

          http://rubular.com --> immediately matching the whole string; like $GetDocMatchAll$ it's highlighting five positions where it achieves a "zero hit" and matches '2011'

          RegEx-Coach http://weitz.de/regex-coach/ --> showing exactly the same behavior as NT, i.e. the cursor doesn't get away from the start of line. '2011' is matched only when setting the 'Control' to 'Scan #6 from 5'.

          (2) The point in that issue seems to be the '0' in '\d{0,4}' (or it's equivalent \d*). Testing '\d{4}' has no problem to immediately match '2011'.

          Unfortunately, the NT 'Help on Regular Expressions' doesn't say much about quantifiers like that. All I can find are two mentionings:

          (2.1) "The quantifier {0} is permitted, causing the expression to behave as if the previous item and the quantifier were not present."

          Hmmm - all right. 'aaabbb' is matched with 'aaax{0}bbb'. But I can't see how this could explain that behavior in question.

          (2.2) "If a pattern starts with .* or .{0,} (...) the pattern is implicitly anchored (...) PCRE normally treats such a pattern as though it were preceded by \A."

          This seems to get close to the point, but it's mentioned in a context with the PCRE_DOTALL option which, for me, seems to be irrelevant here - isn't it?

          Any more ideas?

          Flo
        • diodeom
          ... I d be surprised to find thorough consistency between different implementations of even the same version of the very same RegEx library. Besides the
          Message 4 of 11 , Feb 18, 2011
          • 0 Attachment
            --- In ntb-clips@yahoogroups.com, "flo.gehrke" <flo.gehrke@...> wrote:
            >
            > --- In ntb-clips@yahoogroups.com, "diodeom" <diomir@> wrote:
            > >
            > > I believe it's by no design flaw, Flo. :) I'd consider that --
            > > unlike GetDocMatchAll (or Replace All) -- Find operates in
            > > singular contexts...
            >
            > diodeom,
            >
            > Thanks for your reply! Nevertheless, it's still a confusing issue for me because, so far, I do not see the logic behind that. Here are some more observations:
            >
            > (1) Is it a matter of '^!Find'? Or could it have to do with the specific RegEx flavor and the type of RegEx Engine that's being used with NT?
            >
            > When testing '\d{0,4}' against the string 'year 2011', there are different results, for example, with...
            >
            > http://regexpal.com --> immediately matching '2011' in that string
            >
            > http://rubular.com --> immediately matching the whole string; like $GetDocMatchAll$ it's highlighting five positions where it achieves a "zero hit" and matches '2011'
            >
            > RegEx-Coach http://weitz.de/regex-coach/ --> showing exactly the same behavior as NT, i.e. the cursor doesn't get away from the start of line. '2011' is matched only when setting the 'Control' to 'Scan #6 from 5'.
            >
            > (2) The point in that issue seems to be the '0' in '\d{0,4}' (or it's equivalent \d*). Testing '\d{4}' has no problem to immediately match '2011'.
            >
            > Unfortunately, the NT 'Help on Regular Expressions' doesn't say much about quantifiers like that. All I can find are two mentionings:
            >
            > (2.1) "The quantifier {0} is permitted, causing the expression to behave as if the previous item and the quantifier were not present."
            >
            > Hmmm - all right. 'aaabbb' is matched with 'aaax{0}bbb'. But I can't see how this could explain that behavior in question.
            >
            > (2.2) "If a pattern starts with .* or .{0,} (...) the pattern is implicitly anchored (...) PCRE normally treats such a pattern as though it were preceded by \A."
            >
            > This seems to get close to the point, but it's mentioned in a context with the PCRE_DOTALL option which, for me, seems to be irrelevant here - isn't it?
            >
            > Any more ideas?
            >
            > Flo
            >


            I'd be surprised to find thorough consistency between different implementations of even the same version of the very same RegEx library. Besides the build-time settings to mess with, there could be all sorts of hand-holding or -- depending on one's point of view -- hand-hindering trinkets added afterwards in the encompassing software.

            By now, Flo, I'm even less certain whether you're questioning the Nulltreffer itself or only continue to ponder why NoteTab's Find doesn't attempt to predict your apparent intentions to disregard previously found zero-size match on any subsequent execution. My speculation on the latter issue (focused on the quantitative context) still pacifies my (very mild to-begin-with) curiosity for the time being; the first one (zero quantifier), if it is an issue at all, I'd try (just in case) to philosophically ;) approach as follows:

            If we look for absence of something, we always find it at any given spot -- unless of course this very something (that we don't want to find) actually happens to be there.
          • diodeom
            ... FWIW: If we add a comma after zero (optionally followed by the max range number), it will make the search greedy (up to the optional max), but if the
            Message 5 of 11 , Feb 18, 2011
            • 0 Attachment
              --- In ntb-clips@yahoogroups.com, some goof wrote:
              >
              > If we look for absence of something, we always find it at any given spot -- unless of course this very something (that we don't want to find) actually happens to be there.
              >


              FWIW: If we add a comma after zero (optionally followed by the max range number), it will make the search greedy (up to the optional max), but if the quantified pattern is not found at the given starting position at non-zero quantity (that is, even if it's not found at all), a match of "no match" is still made in that spot -- because of the silly zero.
            • John Shotsky
              I ve been following this thread with some interest, since I use ranges extensively, but I cannot understand why one would start such a range with zero. It
              Message 6 of 11 , Feb 18, 2011
              • 0 Attachment
                I've been following this thread with some interest, since I use ranges
                extensively, but I cannot understand why one would start such a range with
                zero. It essentially means, 'no numbers' up to 4 numbers, which simply makes
                no sense to me. Why not {1,4}, or is there something that is wanted that I
                don't perceive?



                When I tested this, I found that if the cursor is placed exactly before
                *any* number, it finds the subsequent numbers, up to 4. (Change it to
                200111111 to see this.) So, if placed between 2 and 001, it finds 0011. If
                placed before 2001, it finds 2001. If placed before a space, followed by
                numbers, it doesn't find the numbers at all. I take that to mean that if it
                doesn't find any numbers following the cursor, it won't report a find.



                When using {1,4}, it skips everything until it hits a number, then reports
                exactly as above. That is, they work exactly the same, except that {0,4}
                will not skip past a non number to find a subsequent number. Since these
                actions are different, I can see how it could be used to test for the
                presence of a 'remaining' number from the current cursor position, rather
                than skipping forward, looking for a string of numbers. That is, it could be
                useful, but I don't see a reason why it would be better than {1,4}, unless
                one was explicitly requiring the numbers to occupy the space following the
                cursor. Apparently, it could be used to 'chunk' a number into discrete
                groups.

                John



                From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf
                Of diodeom
                Sent: Friday, February 18, 2011 8:24 PM
                To: ntb-clips@yahoogroups.com
                Subject: [Clip] Re: Problem with {0,4} quantifier






                --- In ntb-clips@yahoogroups.com <mailto:ntb-clips%40yahoogroups.com> , some
                goof wrote:
                >
                > If we look for absence of something, we always find it at any given spot
                -- unless of course this very something (that we don't want to find)
                actually happens to be there.
                >

                FWIW: If we add a comma after zero (optionally followed by the max range
                number), it will make the search greedy (up to the optional max), but if the
                quantified pattern is not found at the given starting position at non-zero
                quantity (that is, even if it's not found at all), a match of "no match" is
                still made in that spot -- because of the silly zero.





                [Non-text portions of this message have been removed]
              • Axel Berger
                ... Actually it does make sense when part of a longer search pattern. I have the *? quatifier more than once. It means there may or may not be number here .
                Message 7 of 11 , Feb 19, 2011
                • 0 Attachment
                  John Shotsky wrote:
                  > but I cannot understand why one would start such a range with
                  > zero. It essentially means, 'no numbers' up to 4 numbers,
                  > which simply makes no sense to me.

                  Actually it does make sense when part of a longer search pattern. I have
                  the *? quatifier more than once. It means "there may or may not be
                  number here". But used as the only pattern like in the example discussed
                  here, I agree it wouldn't make a lot of sense. If you look for nothing,
                  then "I found nothing" becomes ambiguous.

                  Axel
                • flo.gehrke
                  ... Agreed! That s why... ^!Replace d{0} ! WARS replaces year 2011 with !y!e!a!r! !2!0!1!1! . That is, it finds 10 positions where 0 is absent.
                  Message 8 of 11 , Feb 19, 2011
                  • 0 Attachment
                    --- In ntb-clips@yahoogroups.com, "diodeom" <diomir@...> wrote:
                    > --- In ntb-clips@yahoogroups.com, some goof wrote:
                    > >
                    > > If we look for absence of something, we always find it at any given spot -- unless of course this very something (that we don't want to find) actually happens to be there.
                    >
                    > FWIW: If we add a comma after zero (optionally followed by the max
                    > range number), it will make the search greedy (up to the optional
                    > max), but if the quantified pattern is not found at the given
                    > starting position at non-zero quantity (that is, even if it's not
                    > found at all), a match of "no match" is still made in that spot --
                    > because of the silly zero.

                    Agreed! That's why...

                    ^!Replace "\d{0}" >> "!" WARS

                    replaces 'year 2011' with '!y!e!a!r! !2!0!1!1!'. That is, it finds 10 positions where '0' is absent. But it doesn't nail the cursor to the start of the line like '\d{0,4}' or '\d*' or '\d{0,}'.

                    Well, I shouldn't tax your patience too much with that. Maybe it's like Pythagoras'theorem: You can successfully work with it without understanding why it's true ;-)

                    Thanks to all who have contributed to this topic!

                    Flo
                  • diodeom
                    ... Apples to apples: when issued by Find, either of the above patterns ought to report a single match and, if it s null-sized, nail the cursor for
                    Message 9 of 11 , Feb 20, 2011
                    • 0 Attachment
                      --- In ntb-clips@yahoogroups.com, "flo.gehrke" <flo.gehrke@...> wrote:
                      >
                      > ^!Replace "\d{0}" >> "!" WARS
                      >
                      > replaces 'year 2011' with '!y!e!a!r! !2!0!1!1!'. That is, it finds 10 positions where '0' is absent. But it doesn't nail the cursor to the start of the line like '\d{0,4}' or '\d*' or '\d{0,}'.
                      >


                      Apples to apples: when issued by Find, either of the above patterns ought to report a single match and, if it's null-sized, "nail the cursor" for subsequent tries. But any of these patterns should also properly capture all replacement locations when used in Replace WARS -- because of this command's broad (plural) scope.
                    • Sheri
                      Hi Flo, I do think you got good answers already. I would only add that PCRE has a couple of options that can be set, PCRE_NOTEMPTY and PCRE_NOTEMPTY_ATSTART.
                      Message 10 of 11 , Feb 21, 2011
                      • 0 Attachment
                        Hi Flo,

                        I do think you got good answers already. I would only add that PCRE has a couple of options that can be set, PCRE_NOTEMPTY and PCRE_NOTEMPTY_ATSTART. These are "exec", not "compile" options, so there is no means to implement either with some construct in the pattern. NoteTab does not provide a means to set either, but FWIW, Powerpro's regex plugin does.

                        With PCRE_NOTEMPTY set, your pattern would not match at the first position or between each letter. It matches at 2011.

                        With PCRE_NOTEMPTY_ATSTART set, it doesn't match at the y, but does match at the e.

                        For its ^!Find and dialog Find features, it seems clear that after the first match NoteTab advances to the end of the previous match for retries. The end of the previous match when the that match is an empty string is, well, no advancement at all!

                        Regards,
                        Sheri
                      • flo.gehrke
                        ... Thank you, Sheri! Fog is lifting ;-) Also thanks again to diodeom for his patience in paying attention to that issue. Flo
                        Message 11 of 11 , Feb 21, 2011
                        • 0 Attachment
                          --- In ntb-clips@yahoogroups.com, "Sheri" <silvermoonwoman@...> wrote:
                          >
                          > Hi Flo,
                          >
                          > I do think you got good answers already. I would only add
                          > that PCRE has a couple of options that can be set (...)
                          > For its ^!Find and dialog Find features, it seems clear that...

                          Thank you, Sheri! Fog is lifting ;-)

                          Also thanks again to diodeom for his patience in paying attention to that issue.

                          Flo
                        Your message has been successfully submitted and would be delivered to recipients shortly.