Loading ...
Sorry, an error occurred while loading the content.

RE: [Clip] Greedy problem in regular expressions

Expand Messages
  • Jamal Mazrui
    Thanks, Don--I appreciate the tips and examples. Can you describe the approach you take generally? I infer that you make use of the ^ at the beginning of a
    Message 1 of 10 , Jun 1, 2005
    • 0 Attachment
      Thanks, Don--I appreciate the tips and examples. Can you describe the
      approach you take generally? I infer that you make use of the ^ at the
      beginning of a line, the $ at the end, and a [^ ] type of expression as
      much as possible. In conceptual terms if possible, what enables these
      techniques to defeat the greedy problem?

      Jamal

      -----Original Message-----
      From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On
      Behalf Of Don Daugherty
      Sent: Friday, May 27, 2005 10:49 AM
      To: ntb-clips@yahoogroups.com
      Subject: Re: [Clip] Greedy problem in regular expressions


      Jamal Mazrui wrote:

      >I have run into the "greedy problem" of NTB regular expressions enough
      >that I thought of asking for general tips about this. Since it is not
      >possible to tell NTB to be conservative rather than greedy in a match,
      >what are ways of working around the problem of more text than desired
      >being matched (especially problematic with the ^!Replace command)?
      >
      >
      I got in late on this item, but I didn't see any other responses that
      looked like what I do, so I thought I'd reply.
      If you've already dealt with your problem, that's great.

      To illustrate what I do, I'll take an example from a problem I saw
      earlier in this group. I believe someone wanted to locate the SECOND
      instance of ":" in any line and delete the rest each line. To do that
      I'd use
      ^!Replace "^{[^:']*:[^:']*:}.*$" >> "\1" RWAS
      This uses the construction "[^:]*" as a wild card for any number of
      characters NOT equal to a ":"

      As another example (rather extreme, maybe!), in an html files I needed
      to locate any line where the 17th column in any row in a table did not
      begin with one of seven characters: "A", "B", "C", "D", "E", "F" or" -"
      and insert an extra cell containing "-".

      The following command is what I used. (The command must be typed on one
      line but is wrapped here for email purposes.)

      LINE BEGINS HERE:
      ^!Replace "^{<tr\sAlign=\"center\">
      <td[^<]*</td><td[^<]*</td><td[^<]*</td><td[^<]*</td><td[^<]*</td>
      <td[^<]*</td><td[^<]*</td><td[^<]*</td><td[^<]*</td><td[^<]*</td>
      <td[^<]*</td><td[^<]*</td><td[^<]*</td><td[^<]*</td><td[^<]*</td>
      <td[^<]*</td><td[^<]*</td><td>}{[^ABCDEF-]}" >> "\1-</td><td>\2" RWAS
      LINE ENDS HERE



      Fookes Software: http://www.fookes.us, http://www.fookes.com
      Fookes Software Mailing Lists: http://www.fookes.us/maillist.htm

      Yahoo! Groups Links
    • Don Daugherty
      ... The technique involves using a wild-card that will stop the search at it s first occurent, thereby preventing the greediness from doing me in. If I want
      Message 2 of 10 , Jun 2, 2005
      • 0 Attachment
        Jamal Mazrui wrote:

        >Thanks, Don--I appreciate the tips and examples. Can you describe the
        >approach you take generally? I infer that you make use of the ^ at the
        >beginning of a line, the $ at the end, and a [^ ] type of expression as
        >much as possible. In conceptual terms if possible, what enables these
        >techniques to defeat the greedy problem?
        >
        >
        The technique involves using a wild-card that will stop the search at
        it's first occurent, thereby preventing the greediness from doing me
        in. If I want to stop on the first occurence of, say "x", in a given
        line I search for "^[^x]*x". this means, starting a the beginning of the
        line (^), find any sequence of zero or more characters NOT equal to "x"
        followed by an "x". Thus in the following line:

        now is the time to get an x-ray of my teeth

        The search would select:
        now is the time to get an x

        The following quote from the NoteTab help file (the regular one, not the
        one for Clip Programming) explains the use of the square brackets and
        the caret inside:

        A string enclosed in brackets [] specifies a character class. Any single
        character in the string is matched. For example, [abc] matches an a, b,
        or c. Ranges of ASCII letters and numbers can be abbreviated as, for
        example, [a-z0-9]. If the first symbol following the [ is a caret (^)
        then a negative character class is specified. In this case, the string
        matches all characters EXCEPT those enclosed in the brackets. For
        example, [^a-z] matches everything except lower case characters (and
        newlines).

        The use of ^ and $ outside of [] to stand for the line beginning and end
        is also discussed there.

        Hope this explanation helps. If not, feel free to ask more.
      Your message has been successfully submitted and would be delivered to recipients shortly.