301Re: [jasspa] Greediness of regexp '+', '*' operators
- Sep 13, 2000We had a lot of discussion about this (myself + Steve) when
the new RE engine was developed. The old behaviour was the
minimal set as you pointed out below, which did appear to
be a little more logical (I had actually modified the search
to this behaviour years ago). However it is very confusing when
you move to other packages when the search has the same syntax
and you get different results. For this reason it was more prudent
to be conforment with other packages which basically means that
your RE must be unambiguous hence for the search below then one
In fact the old shortened search actually used to fail more
because it used to bail out earlier. One could specify a
RE that was quite clearly within a line and would
not find it because it never looked far enough (OK - I admit
the old search engine was flawed).
I would also point out that when you specify the shortened
RE you also sometimes do not get what you want. In the
same way that you are getting "too much" matching
below, with the shortened RE you sometimes do not
"get enough". So to be honest I think you have just
made the RE syntax a little bigger and now have 2 problems
instead of one !! (One also has to bear in mind that the
search engine is a real hairy piece of code and is not
to be messed with lightly).
So, I've kind of made up my mind the greedy RE is better -
you just have to be a little bit more specific as to what
you want. Steve's new RE engine is now real fast and works
a treat for incremental searches with '*'s and '+'s
present (used to be dead slow).
Well that's the end of my ramblings !!
Thomas Hundt wrote:
> When used in isearch-forward or query-replace-string regular expressions, the '+' and '*' quantifiers will match as many characters as possible, apparently stopping at a newline.
> For example, I wanted to remove the FONT tags in the html below, by doing a query-replace-string of "<FONT.+>" with "". But ME went and matched not what I wanted ("<FONT FACE="Verdana, MS Sans Serif, Geneva" SIZE="-1">") but the whole rest of the line, too: "<FONT FACE="Verdana, MS Sans Serif, Geneva" SIZE="-1"><B>Mixed Drinks/Liquor</B></FONT></NOBR></TD>". The "+" matched as many characters as possible. Some people call this "greediness".
> This is a problem not just in ME, but crops up in various places. One way of dealing with it (seen in TCL and Perl) is a "?" qualifier used after the "*" or "+" to tell it to act in non-greedy fashion, i.e., to match as few characters as possible. I think it would be nice if ME had something like this.
> [example html code]
> <TD><NOBR><FONT FACE="Verdana, MS Sans Serif, Geneva" SIZE="-1"><B>Mixed Drinks/Liquor</B></FONT></NOBR></TD>
> <TD><NOBR><FONT FACE="Verdana, MS Sans Serif, Geneva" SIZE="-1"><B>Wine</B></FONT></NOBR></TD>
> <TD><NOBR><FONT FACE="Verdana, MS Sans Serif, Geneva" SIZE="-1"><B>Beer</B></FONT></NOBR></TD>
> This is an unmoderated list. JASSPA is not responsible for the content of
> any material posted to this list.