Loading ...
Sorry, an error occurred while loading the content.
 

Re: [NTS] NTB or RegEx Bug

Expand Messages
  • Eric Fookes
    Hi Art, ... Indeed, it looks like a bug in the regex engine of those NoteTab versions. I just tested NoteTab 7 and it worked correctly. So it seems the updated
    Message 1 of 11 , Sep 28, 2012
      Hi Art,

      > However, for the string: aaaaaaaa
      >
      > The pattern matches Comment
      > ======== ======== =====================
      > a* nothing!!! greedy, ##### NOT per help file ####
      >
      > Is this a bug in RegEx or in NTB? (tested both NT Std 5.8/fv & 6.2/fv)

      Indeed, it looks like a bug in the regex engine of those NoteTab
      versions. I just tested NoteTab 7 and it worked correctly. So it seems
      the updated regex engine used in the recent NoteTab releases fixed the
      issue.

      --
      Regards,

      Eric Fookes
      http://www.fookes.com/
    • John Shotsky
      It may be a matter of opinion, but when a * is used, it will first evaluate the very next character following the cursor or the point in the line that is being
      Message 2 of 11 , Sep 28, 2012
        It may be a matter of opinion, but when a * is used, it will first evaluate the very next character following the cursor
        or the point in the line that is being evaluated. If it meets the condition, it will stop. (In this case, the zero
        condition is true, so it stops.) Thus, if the first character is a space or a CR, it will stop. I have run into this
        issue in programming before, and didn't understand why it wasn't working as I thought it should then. Finally, it
        occurred that it will start with one character, and if that character is NOT the one looked for, it will stop. You will
        notice that if you place it anywhere in the line of a's, it will capture that one and all the rest, but none before that
        point. Once understood, it becomes a feature that you can use to detect that condition.

        Regards,
        John
        RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/

        From: ntb-scripts@yahoogroups.com [mailto:ntb-scripts@yahoogroups.com] On Behalf Of Art Kocsis
        Sent: Friday, September 28, 2012 07:21
        To: NoteTab-Scripts
        Subject: RE: [NTS] NTB or RegEx Bug


        aaaaaaaaaaa

        I disagree that it is working as it is supposed to. Reread the doc - for
        greedy it says "the maximum number", not zero. Even starting from a
        previous line it should capture all of the a's.

        Or better yet, add some spaces in front of the string of a's and place the
        cursor within them. Again, a* captures zero chars. Zero satisfies the
        condition "zero or more" but it does not satisfy the greedy condition of
        "maximum possible".

        Even stranger, if you place the cursor one space in front of the string of
        a's and a* will capture the entire string (as it should). However, place
        the cursor two or more spaces before the string and it captures zero a's
        (not as it should) but it advances the cursor one position! Not as it should!

        I wonder if this is one of their "optimizations" gone awry.

        Art

        At 9/28/2012 04:42 AM, John wrote:
        >I think it is working as it is supposed to. You probably don't have your
        >cursor in the line at the time you are testing.
        >If it is anywhere else, the 'zero or more' is met by a CR. If you place it
        >anywhere in the line, it will match from the
        >cursor to the end of the line, as expected.
        >
        >
        >From: ntb-scripts@yahoogroups.com <mailto:ntb-scripts%40yahoogroups.com> [mailto:ntb-scripts@yahoogroups.com
        <mailto:ntb-scripts%40yahoogroups.com> ] On
        >Behalf Of Art Kocsis
        >Sent: Thursday, September 27, 2012 22:48
        >To: NoteTab-Scripts
        >Subject: [NTS] NTB or RegEx Bug
        >
        >According the RegEx help file:
        >
        >By default, the quantifiers are "greedy", that is, they match as much
        >as possible (up to the maximum number of permitted times),
        >without causing the rest of the pattern to fail.
        >
        >However, if a quantifier is followed by a question mark, it ceases to
        >be greedy, and instead matches the minimum number of times possible
        >
        >With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
        >repetition, failure of what follows normally causes the repeated
        >item to be re-evaluated to see if a different number of repeats
        >allows the rest of the pattern to match.
        >
        >However, for the string: aaaaaaaa
        >
        >The pattern matches Comment
        >======== ======== =====================
        > a a as expected
        > a+ aaaaaaaa greedy, as expected per help
        > a* nothing!!! greedy, ##### NOT per help file ####
        > a+? a lazy, as expected per help
        > a*? nothing lazy, as expected per help
        >
        >According to the help file, the pattern "a*" should have matched the
        >maximum permitted - i.e., the entire strings of "a"s. However, it stopped
        >at zero. Why?
        >
        >Is this a bug in RegEx or in NTB? (tested both NT Std 5.8/fv & 6.2/fv)



        [Non-text portions of this message have been removed]
      • flo.gehrke
        ... No bug -- neither in PCRE nor in NTb. a* is equivalent to a{0,}. So, at the beginning of the subject string, the engine achieves a match of zero length
        Message 3 of 11 , Sep 28, 2012
          --- In ntb-scripts@yahoogroups.com, Art Kocsis <artkns@...> wrote:
          >
          > However, for the string: aaaaaaaa
          > a* nothing!!!
          >
          > Is this a bug in RegEx or in NTB? (tested both NT
          > Std 5.8/fv & 6.2/fv)

          No bug -- neither in PCRE nor in NTb.

          'a*' is equivalent to a{0,}. So, at the beginning of the subject string, the engine achieves a match of zero length because in...

          However...aaa

          it finds a 'H' at the beginning, i.e. the absence of 'a'. Since this doesn't consume any character you don't see it.

          Try this: Open the Find dialog, enter 'a*', and close it again. Now press F3 repeatedly and watch the cursor moving forward. It stops at any position where the pattern is true, i.e. where an 'a' is absent or where the engine doesn't see an 'a' when looking to the right. Ntb will match 'aaa' as soon as the engine reaches that string.

          There is an PCRE_NotEmpty match option that changes this behavior. You can test this with the Workbench for DIRegEx (the embedding of PCRE into Ntb). But there is no way to activate that option in Ntb. When choosing that option, the engine immediately selects 'aaa'.

          Regards,
          Flo
        • Art Kocsis
          Yes, a zero length match satisfies a{0,}. So does all of other the remaining partial matches. But the whole point is that a* is supposed to be greedy. Look at
          Message 4 of 11 , Sep 28, 2012
            Yes, a zero length match satisfies a{0,}. So does all of other the
            remaining partial matches. But the whole point is that a* is supposed to
            be greedy. Look at the definition of greedy - it clearly specifies the
            MAXIMUM match, not the minimum. A zero length match is the minimum not the
            maximum possibility. The question mark "lazy" metacharacter specifies the
            minimum match. There should be a difference between greedy and non-greedy.
            Even Eric agreed.

            Also, a non-match should not move the cursor. Try any other search that
            fails and
            observe the status bar cursor location. It does not move. Contriwise, a
            normal search starts at the current cursor position and searches forward
            (to the end of the file if necessary), looking for a possible match. a*
            doesn't seem to want to get off its duff to even start. [Does that mean
            that a* is lazy? <g>]

            BTW - You don't have to go thru all the hassle of defining, closing and
            reopening the Find window. Clicking on Find Next works quite well. As does
            F3 with the window open.

            BTW2 - Thanks for the heads up on DIRegEx. I will look at it. I assume that
            is the one you use. Is it freeware or shareware? I can't find any
            registration info on the site. What other RegEx apps have you tried?
            [http://www.yunqa.de/delphi/doku.php/products/regex/index#diregex%5d

            Art

            At 9/28/2012 07:56 AM, Flo wrote:
            >--- In <mailto:ntb-scripts%40yahoogroups.com>ntb-scripts@yahoogroups.com,
            >Art Kocsis <artkns@...> wrote:
            > >
            > > However, for the string: aaaaaaaa
            > > a* nothing!!!
            > >
            > > Is this a bug in RegEx or in NTB? (tested both NT
            > > Std 5.8/fv & 6.2/fv)
            >
            >No bug -- neither in PCRE nor in NTb.
            >
            >'a*' is equivalent to a{0,}. So, at the beginning of the subject string,
            >the engine achieves a match of zero length because in...
            >
            >However...aaa
            >
            >it finds a 'H' at the beginning, i.e. the absence of 'a'. Since this
            >doesn't consume any character you don't see it.
            >
            >Try this: Open the Find dialog, enter 'a*', and close it again. Now press
            >F3 repeatedly and watch the cursor moving forward. It stops at any
            >position where the pattern is true, i.e. where an 'a' is absent or where
            >the engine doesn't see an 'a' when looking to the right. Ntb will match
            >'aaa' as soon as the engine reaches that string.
            >
            >There is an PCRE_NotEmpty match option that changes this behavior. You can
            >test this with the Workbench for DIRegEx (the embedding of PCRE into Ntb).
            >But there is no way to activate that option in Ntb. When choosing that
            >option, the engine immediately selects 'aaa'.
            >
            >Regards,
            >Flo
          • flo.gehrke
            I understand that you are testing a subject string like... However ... aaa ... starting from the beginning of the line. So aaa is not at the start of line
            Message 5 of 11 , Sep 29, 2012
              I understand that you are testing a subject string like...

              However ... aaa ...

              starting from the beginning of the line. So 'aaa' is not at the start of line but somewhere behind 'However...'.

              > Yes, a zero length match satisfies a{0,}

              OK, so the difference between "no match" and "match of zero length (or 'zero match') can be assumed as clear.

              > But the whole point is that a* is supposed to
              > be greedy.

              No doubt, but a "sequence of zero matches" at the same position is not imaginable -- so greedyness doesn't matter here at the first positions. Greedyness matters the first time when the engine is reaching 'aaa'. At that position only, the pattern matches all 'a' since it is greedy.

              > Also, a non-match should not move the cursor.

              Possibly, there are two misunderstandings here: 1. Again, the engine doesn't achieve "non-matches" but matches of zero length. 2. The cursor isn't moved here by "non-matches" but by repeatedly re-starting Find.

              > Contriwise, a normal search starts at the current cursor
              > position and searches forward (to the end of the file if
              > necessary)

              No doubt, and that's exactly what the engine does, even in this case. Compare it with...

              ^!Replace "a*" >> "!" WARS

              tested against 'xxxxxx'. The result is '!x!x!x!x!x!x!' -- i.e., the engine achieves seven zero matches at any position where ' zero a' is true.

              Another question is: Is there a work-around in Ntb for that issue?

              You could use 'a*a' instead of 'a*'. In this case, you are forcing the engine to find zero or more 'a' being followed by another 'a'. Thus the zero matches don't act as a "brake" any more, and the engine will immediately find and select 'aaa'.

              > Thanks for the heads up on DIRegEx. I will look at it. I assume
              > that is the one you use. Is it freeware or shareware?

              For a short time, it was available for betatesters but now the link doesn't work any more.

              > What other RegEx apps have you tried?

              So far, I don't know any app that fully supports PCRE -- except that workbench and Ntb itself. Sometimes helpful is http://weitz.de/regex-coach/ but it isn't updated to the latest version of PCRE and doesn't support some PCRE features either. I think that's the same problem with RegexBuddy which has often been recommended by Alec Burgess. Please correct me if I'm wrong.

              Regards,
              Flo

              --- In ntb-scripts@yahoogroups.com, Art Kocsis <artkns@...> wrote:
              >
              > Yes, a zero length match satisfies a{0,}. So does all of other the
              > remaining partial matches. But the whole point is that a* is supposed to
              > be greedy. Look at the definition of greedy - it clearly specifies the
              > MAXIMUM match, not the minimum. A zero length match is the minimum not the
              > maximum possibility. The question mark "lazy" metacharacter specifies the
              > minimum match. There should be a difference between greedy and non-greedy.
              > Even Eric agreed.
              >
              > Also, a non-match should not move the cursor. Try any other search that
              > fails and
              > observe the status bar cursor location. It does not move. Contriwise, a
              > normal search starts at the current cursor position and searches forward
              > (to the end of the file if necessary), looking for a possible match. a*
              > doesn't seem to want to get off its duff to even start. [Does that mean
              > that a* is lazy? <g>]
              >
              > BTW - You don't have to go thru all the hassle of defining, closing and
              > reopening the Find window. Clicking on Find Next works quite well. As does
              > F3 with the window open.
              >
              > BTW2 - Thanks for the heads up on DIRegEx. I will look at it. I assume that
              > is the one you use. Is it freeware or shareware? I can't find any
              > registration info on the site. What other RegEx apps have you tried?
              > [http://www.yunqa.de/delphi/doku.php/products/regex/index#diregex%5d
              >
              > Art
              >
              > At 9/28/2012 07:56 AM, Flo wrote:
              > >--- In <mailto:ntb-scripts%40yahoogroups.com>ntb-scripts@yahoogroups.com,
              > >Art Kocsis <artkns@> wrote:
              > > >
              > > > However, for the string: aaaaaaaa
              > > > a* nothing!!!
              > > >
              > > > Is this a bug in RegEx or in NTB? (tested both NT
              > > > Std 5.8/fv & 6.2/fv)
              > >
              > >No bug -- neither in PCRE nor in NTb.
              > >
              > >'a*' is equivalent to a{0,}. So, at the beginning of the subject string,
              > >the engine achieves a match of zero length because in...
              > >
              > >However...aaa
              > >
              > >it finds a 'H' at the beginning, i.e. the absence of 'a'. Since this
              > >doesn't consume any character you don't see it.
              > >
              > >Try this: Open the Find dialog, enter 'a*', and close it again. Now press
              > >F3 repeatedly and watch the cursor moving forward. It stops at any
              > >position where the pattern is true, i.e. where an 'a' is absent or where
              > >the engine doesn't see an 'a' when looking to the right. Ntb will match
              > >'aaa' as soon as the engine reaches that string.
              > >
              > >There is an PCRE_NotEmpty match option that changes this behavior. You can
              > >test this with the Workbench for DIRegEx (the embedding of PCRE into Ntb).
              > >But there is no way to activate that option in Ntb. When choosing that
              > >option, the engine immediately selects 'aaa'.
              > >
              > >Regards,
              > >Flo
              >
            • Axel Berger
              ... I think in practice a star quantifier on its own is meaningless, there must be at least one other thing in the pattern. If you re interested in one
              Message 6 of 11 , Sep 29, 2012
                "flo.gehrke" wrote:
                > You could use 'a*a' instead of 'a*'.

                I think in practice a star quantifier on its own is meaningless, there
                must be at least one other thing in the pattern. If you're interested in
                one character only, then the quantifier has to be at least "+". All
                Art's "a*x" examples worked fine and so do all cases where I use the "*"
                or "?" quantifier.
                What can the possible use be for something that matches anywhere in
                anything?

                Axel
              • flo.gehrke
                ... In this context, I understood that a is just an element that, in practice , would primarily represent an element in a more complex pattern. In this
                Message 7 of 11 , Sep 30, 2012
                  --- In ntb-scripts@yahoogroups.com, Axel Berger <Axel-Berger@...> wrote:
                  >
                  > "flo.gehrke" wrote:
                  > > You could use 'a*a' instead of 'a*'.
                  > I think in practice a star quantifier on its own is
                  > meaningless, there must be at least one other thing in the
                  > pattern...What can the possible use be for something that
                  > matches anywhere in anything?

                  In this context, I understood that 'a' is just an element that, "in practice", would primarily represent an element in a more complex pattern. In this respect, I agree with the objection you made.

                  In order to match just a sequence of literal 'a', a pattern like 'a*a' wouldn't make much sense, indeed. And 'a{1,}' or 'a+' would certainly be more appropriate solutions.

                  But to prevent any misunderstanding among beginners, we should stress that something like 'a*' is not at all useless under ANY circumstances. Quite often, we have to define that an element 'a' is there or it is not there.

                  For example: (?<=<xxx>)\d*(?=</xxx>) matching the position between '>' and '<' in strings like...

                  <xxx>12</xxx>
                  <xxx></xxx>
                  <xxx>9</xxx>

                  no matter if there is a number or no number.

                  Regards,
                  Flo
                • John Shotsky
                  I agree, and use the star heavily in my clip libraries. Regards, John RecipeTools Web Site: http://recipetools.gotdns.com/
                  Message 8 of 11 , Sep 30, 2012
                    I agree, and use the star heavily in my clip libraries.

                    Regards,
                    John
                    RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/

                    From: ntb-scripts@yahoogroups.com [mailto:ntb-scripts@yahoogroups.com] On Behalf Of flo.gehrke
                    Sent: Sunday, September 30, 2012 05:39
                    To: ntb-scripts@yahoogroups.com
                    Subject: [NTS] Re: NTB or RegEx Bug


                    --- In ntb-scripts@yahoogroups.com <mailto:ntb-scripts%40yahoogroups.com> , Axel Berger <Axel-Berger@...> wrote:
                    >
                    > "flo.gehrke" wrote:
                    > > You could use 'a*a' instead of 'a*'.
                    > I think in practice a star quantifier on its own is
                    > meaningless, there must be at least one other thing in the
                    > pattern...What can the possible use be for something that
                    > matches anywhere in anything?

                    In this context, I understood that 'a' is just an element that, "in practice", would primarily represent an element in a
                    more complex pattern. In this respect, I agree with the objection you made.

                    In order to match just a sequence of literal 'a', a pattern like 'a*a' wouldn't make much sense, indeed. And 'a{1,}' or
                    'a+' would certainly be more appropriate solutions.

                    But to prevent any misunderstanding among beginners, we should stress that something like 'a*' is not at all useless
                    under ANY circumstances. Quite often, we have to define that an element 'a' is there or it is not there.

                    For example: (?<=<xxx>)\d*(?=</xxx>) matching the position between '>' and '<' in strings like...

                    <xxx>12</xxx>
                    <xxx></xxx>
                    <xxx>9</xxx>

                    no matter if there is a number or no number.

                    Regards,
                    Flo



                    [Non-text portions of this message have been removed]
                  Your message has been successfully submitted and would be delivered to recipients shortly.