Loading ...
Sorry, an error occurred while loading the content.
 

Greedy problem in regular expressions

Expand Messages
  • Jamal Mazrui
    I have run into the greedy problem of NTB regular expressions enough that I thought of asking for general tips about this. Since it is not possible to tell
    Message 1 of 10 , May 11, 2005
      I have run into the "greedy problem" of NTB regular expressions enough
      that I thought of asking for general tips about this. Since it is not
      possible to tell NTB to be conservative rather than greedy in a match,
      what are ways of working around the problem of more text than desired
      being matched (especially problematic with the ^!Replace command)?

      With other regular expression engines, including the RegExp COM object
      of the Windows Script Host, one can force a conservative match with a ?
      character following a * character. Since I have used this capability
      with other scripting languages, I am tempted to write a simple utility
      that an NTB clip can call to perform a search and replace with this
      syntax. In case others have already done something similar though, I'd
      like to examine those solutions first.

      I know I could install Perl or another 3rd party scripting package, but
      I'm hoping not to have to learn another scripting language or install a
      bunch of additional files in order to solve the greedy problem.

      Any suggestions?

      Jamal
    • Hugo Paulissen
      ... utility ... I d ... Jamal, Jody has developed some clips for use with Agent Ransack, but I cannot find them now. Here is his description from
      Message 2 of 10 , May 11, 2005
        > with other scripting languages, I am tempted to write a simple
        utility
        > that an NTB clip can call to perform a search and replace with this
        > syntax. In case others have already done something similar though,
        I'd
        > like to examine those solutions first.

        Jamal,

        Jody has developed some clips for use with Agent Ransack, but I
        cannot find them now.

        Here is his description from www.notetab.net ->

        "Agent Ransack
        --------------
        This is a nice little search utility for finding information in
        documents. Most people will be happy with NoteTab's Search Disk.
        However, there is a wizard to help you use regular expressions, so if
        you do not know how to use them you might find Ransack to help you
        get started. There is a Library that I made for it on the CD-ROM.

        (I actually made it smaller for the CD than it is. I have some
        special Clips to search bible text, but felt it was not appropriate
        for the CD. Please write Jody if you are interested in it at
        Support@...)

        Regards,

        Hugo
      • Alan C
        ... ditto in the Perl language. Are you aware of the ntb-scripts list and its archives? http://groups.yahoo.com/group/ntb-scripts/message/297 If you don t
        Message 3 of 10 , May 11, 2005
          Jamal Mazrui wrote:

          >I have run into the "greedy problem" of NTB regular expressions enough
          >that I thought of asking for general tips about this. Since it is not
          >possible to tell NTB to be conservative rather than greedy in a match,
          >what are ways of working around the problem of more text than desired
          >being matched (especially problematic with the ^!Replace command)?
          >
          >With other regular expression engines, including the RegExp COM object
          >of the Windows Script Host, one can force a conservative match with a ?
          >character following a * character.
          >
          ditto in the Perl language. Are you aware of the ntb-scripts list and
          its archives?

          http://groups.yahoo.com/group/ntb-scripts/message/297

          If you don't abhor Perl and I wouldn't blame you if you did then there's
          at least a working solution by following that short thread to its end.

          >I know I could install Perl or another 3rd party scripting package, but
          >I'm hoping not to have to learn another scripting language or install a
          >bunch of additional files in order to solve the greedy problem.
          >
          >
          Oops. I didn't see that part until now.

          So it's likely that you'll want to go with an .exe as recommended by
          Hugo, etc.

          Alternatively the clip "find" (without the regex option) use it with
          loop to insert line break then follow up with regex replace.
        • acumming@cwnet.com
          ... Well, this *may* depend upon what you want to do. But it can work in for example the following scenario. Wants everything text that s between
          Message 4 of 10 , May 12, 2005
            On Thu, 12 May 2005 08:46 , Jamal Mazrui <Jamal.Mazrui@...> sent:
            >Thanks for your response, Alan. You said:
            >"Alternatively the clip "find" (without the regex option) use it with
            >loop to insert line break then follow up with regex replace."
            >
            >Please elaborate on this technique.

            Well, this *may* depend upon what you want to do. But it can work in for example
            the following scenario.

            Wants everything text that's <strong> between </strong> tags in html, change only
            within tags, not outside the tags.

            ^!Jump DOC_START
            :loop4tags
            ^!Find </strong>
            ^!IfError end
            ^!Jump SELECT_END
            ^!Insert ^P
            ^!Goto loop4tags
            ;---end---

            <strong>text 1</strong> <strong> text 2</strong> <strong> text 3</strong>

            in your doc becomes:

            <strong>text 1</strong>
            <strong> text 2</strong>
            <strong> text 3</strong>

            No longer greeed since Notetab tends to use line break as its input record
            separator (regex)

            Next run your regex, and the above becomes:

            <strong>modified_only_here 1</strong>
            <strong> modified_only_here 2</strong>
            <strong> modified_only_here 3</strong>

            ^!Replace "</strong>^P" >> "</strong>" WAIS

            That replace line then reverts your doc to its previous format (rids it of the
            formerly inserted line breaks).
            ---

            Also you may search the web archives of this list. And, while at web archives:

            Alec (Alec b if I'm not mistaken) has formerly shared some impressive clip regexes
            that modify the input record separator so that it is no longer the default. THAT,
            this is your problem: the line break as the input record separator (default setting
            if has not been specified otherwise).

            But there are limitations on specifying a different input record separator (I don't
            want to get your hopes up too high). But certainly likely is worth a look.

            This issue nearly or could be an faq. It comes up again and again from time to
            time on this list.

            Alan.

            ---- Msg sent via CWNet - http://www.cwnet.com/
          • Jamal Mazrui
            Thanks for your response, Alan. You said: Alternatively the clip find (without the regex option) use it with loop to insert line break then follow up with
            Message 5 of 10 , May 12, 2005
              Thanks for your response, Alan. You said:
              "Alternatively the clip "find" (without the regex option) use it with
              loop to insert line break then follow up with regex replace."

              Please elaborate on this technique.

              Jamal
            • acummingsus
              ... [ . . ]THAT, ... (default setting ... I mistakenly over simplified there. Greed still may be involved and needed to be dealt with too, along with the
              Message 6 of 10 , May 12, 2005
                --- In ntb-clips@yahoogroups.com, <acumming@c...> wrote:
                [ . . ]THAT,
                > this is your problem: the line break as the input record separator
                (default setting
                > if has not been specified otherwise).

                I mistakenly over simplified there. Greed still may be involved and
                needed to be dealt with too, along with the input record separator.

                I'm unsure as to what all can be done by changing input record
                separator using clip.

                Perl makes it so easy. And Perl integrates so well with Notetab
                that I merely reverted to using Perl for these sorts of things. End
                of problem (for me).

                Alan.
              • Jamal Mazrui
                I appreciate your explanations. The coding to get around the greedy problem seems daunting, however, so I will probably create an external utility using
                Message 7 of 10 , May 12, 2005
                  I appreciate your explanations. The coding to get around the greedy
                  problem seems daunting, however, so I will probably create an external
                  utility using PowerBasic and the "RegExp" object of the Windows Script
                  Host. I have actually made progress in this approach today, now having
                  a simple, compiled program that NTB calls via the ^!ShellWait command.
                  I would consider Perl, but find it easier to use regular expression
                  syntax and developer tools with which I'm already familiar based on
                  other Windows programming experience.

                  Regards,
                  Jamal

                  -----Original Message-----
                  From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On
                  Behalf Of acummingsus
                  Sent: Thursday, May 12, 2005 3:07 PM
                  To: ntb-clips@yahoogroups.com
                  Subject: Re: [Clip] Greedy problem in regular expressions


                  --- In ntb-clips@yahoogroups.com, <acumming@c...> wrote:
                  [ . . ]THAT,
                  > this is your problem: the line break as the input record separator
                  (default setting
                  > if has not been specified otherwise).

                  I mistakenly over simplified there. Greed still may be involved and
                  needed to be dealt with too, along with the input record separator.

                  I'm unsure as to what all can be done by changing input record
                  separator using clip.

                  Perl makes it so easy. And Perl integrates so well with Notetab
                  that I merely reverted to using Perl for these sorts of things. End
                  of problem (for me).

                  Alan.





                  Fookes Software: http://www.fookes.us, http://www.fookes.com
                  Fookes Software Mailing Lists: http://www.fookes.us/maillist.htm

                  Yahoo! Groups Links
                • Don Daugherty
                  ... I got in late on this item, but I didn t see any other responses that looked like what I do, so I thought I d reply. If you ve already dealt with your
                  Message 8 of 10 , May 27, 2005
                    Jamal Mazrui wrote:

                    >I have run into the "greedy problem" of NTB regular expressions enough
                    >that I thought of asking for general tips about this. Since it is not
                    >possible to tell NTB to be conservative rather than greedy in a match,
                    >what are ways of working around the problem of more text than desired
                    >being matched (especially problematic with the ^!Replace command)?
                    >
                    >
                    I got in late on this item, but I didn't see any other responses that
                    looked like what I do, so I thought I'd reply.
                    If you've already dealt with your problem, that's great.

                    To illustrate what I do, I'll take an example from a problem I saw
                    earlier in this group. I believe someone wanted to locate the SECOND
                    instance of ":" in any line and delete the rest each line. To do that
                    I'd use
                    ^!Replace "^{[^:']*:[^:']*:}.*$" >> "\1" RWAS
                    This uses the construction "[^:]*" as a wild card for any number of
                    characters NOT equal to a ":"

                    As another example (rather extreme, maybe!), in an html files I needed
                    to locate any line where the 17th column in any row in a table did not
                    begin with one of seven characters: "A", "B", "C", "D", "E", "F" or" —"
                    and insert an extra cell containing "—".

                    The following command is what I used. (The command must be typed on one
                    line but is wrapped here for email purposes.)

                    LINE BEGINS HERE:
                    ^!Replace "^{<tr\sAlign=\"center\">
                    <td[^<]*</td><td[^<]*</td><td[^<]*</td><td[^<]*</td><td[^<]*</td>
                    <td[^<]*</td><td[^<]*</td><td[^<]*</td><td[^<]*</td><td[^<]*</td>
                    <td[^<]*</td><td[^<]*</td><td[^<]*</td><td[^<]*</td><td[^<]*</td>
                    <td[^<]*</td><td[^<]*</td><td>}{[^ABCDEF—]}" >> "\1—</td><td>\2" RWAS
                    LINE ENDS HERE
                  • Jamal Mazrui
                    Thanks, Don--I appreciate the tips and examples. Can you describe the approach you take generally? I infer that you make use of the ^ at the beginning of a
                    Message 9 of 10 , Jun 1, 2005
                      Thanks, Don--I appreciate the tips and examples. Can you describe the
                      approach you take generally? I infer that you make use of the ^ at the
                      beginning of a line, the $ at the end, and a [^ ] type of expression as
                      much as possible. In conceptual terms if possible, what enables these
                      techniques to defeat the greedy problem?

                      Jamal

                      -----Original Message-----
                      From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On
                      Behalf Of Don Daugherty
                      Sent: Friday, May 27, 2005 10:49 AM
                      To: ntb-clips@yahoogroups.com
                      Subject: Re: [Clip] Greedy problem in regular expressions


                      Jamal Mazrui wrote:

                      >I have run into the "greedy problem" of NTB regular expressions enough
                      >that I thought of asking for general tips about this. Since it is not
                      >possible to tell NTB to be conservative rather than greedy in a match,
                      >what are ways of working around the problem of more text than desired
                      >being matched (especially problematic with the ^!Replace command)?
                      >
                      >
                      I got in late on this item, but I didn't see any other responses that
                      looked like what I do, so I thought I'd reply.
                      If you've already dealt with your problem, that's great.

                      To illustrate what I do, I'll take an example from a problem I saw
                      earlier in this group. I believe someone wanted to locate the SECOND
                      instance of ":" in any line and delete the rest each line. To do that
                      I'd use
                      ^!Replace "^{[^:']*:[^:']*:}.*$" >> "\1" RWAS
                      This uses the construction "[^:]*" as a wild card for any number of
                      characters NOT equal to a ":"

                      As another example (rather extreme, maybe!), in an html files I needed
                      to locate any line where the 17th column in any row in a table did not
                      begin with one of seven characters: "A", "B", "C", "D", "E", "F" or" -"
                      and insert an extra cell containing "-".

                      The following command is what I used. (The command must be typed on one
                      line but is wrapped here for email purposes.)

                      LINE BEGINS HERE:
                      ^!Replace "^{<tr\sAlign=\"center\">
                      <td[^<]*</td><td[^<]*</td><td[^<]*</td><td[^<]*</td><td[^<]*</td>
                      <td[^<]*</td><td[^<]*</td><td[^<]*</td><td[^<]*</td><td[^<]*</td>
                      <td[^<]*</td><td[^<]*</td><td[^<]*</td><td[^<]*</td><td[^<]*</td>
                      <td[^<]*</td><td[^<]*</td><td>}{[^ABCDEF-]}" >> "\1-</td><td>\2" RWAS
                      LINE ENDS HERE



                      Fookes Software: http://www.fookes.us, http://www.fookes.com
                      Fookes Software Mailing Lists: http://www.fookes.us/maillist.htm

                      Yahoo! Groups Links
                    • Don Daugherty
                      ... The technique involves using a wild-card that will stop the search at it s first occurent, thereby preventing the greediness from doing me in. If I want
                      Message 10 of 10 , Jun 2, 2005
                        Jamal Mazrui wrote:

                        >Thanks, Don--I appreciate the tips and examples. Can you describe the
                        >approach you take generally? I infer that you make use of the ^ at the
                        >beginning of a line, the $ at the end, and a [^ ] type of expression as
                        >much as possible. In conceptual terms if possible, what enables these
                        >techniques to defeat the greedy problem?
                        >
                        >
                        The technique involves using a wild-card that will stop the search at
                        it's first occurent, thereby preventing the greediness from doing me
                        in. If I want to stop on the first occurence of, say "x", in a given
                        line I search for "^[^x]*x". this means, starting a the beginning of the
                        line (^), find any sequence of zero or more characters NOT equal to "x"
                        followed by an "x". Thus in the following line:

                        now is the time to get an x-ray of my teeth

                        The search would select:
                        now is the time to get an x

                        The following quote from the NoteTab help file (the regular one, not the
                        one for Clip Programming) explains the use of the square brackets and
                        the caret inside:

                        A string enclosed in brackets [] specifies a character class. Any single
                        character in the string is matched. For example, [abc] matches an a, b,
                        or c. Ranges of ASCII letters and numbers can be abbreviated as, for
                        example, [a-z0-9]. If the first symbol following the [ is a caret (^)
                        then a negative character class is specified. In this case, the string
                        matches all characters EXCEPT those enclosed in the brackets. For
                        example, [^a-z] matches everything except lower case characters (and
                        newlines).

                        The use of ^ and $ outside of [] to stand for the line beginning and end
                        is also discussed there.

                        Hope this explanation helps. If not, feel free to ask more.
                      Your message has been successfully submitted and would be delivered to recipients shortly.