Thanks a million, i will try it out today.
I want to understand how the regex fits into Xenu code, doesn't Tilmans
regex require three parts:
1. the first part of the URL, minus the extension
2. the extension
3. the ? with parameters (optional)
perhaps his code combines these three parts into the full URL, which is then
A new regex might require an update of the xenu code, tilman, is that
correct? (or am I seeing things?)
Sent: donderdag 7 oktober 2004 12:13
> using your regex from an earlier mail:
> xenu crashes after a while.
There are different regex libraries , each with its own pecularities in
PowerGREP author claims that his regex flavour is fully perl compatible.
And Perl is considered to be a kind of industry standard in regex syntax.
I tested my regexp in PowerGREP and it catches all variants in due way, URLs
or without file extensions, URLs ending with top domain or ending with GET
parameters like http://something.com?param_name=param_value
I suspect regex library Tilman uses does not support all Perl things.
So lets simpify our regex. It must use only basic regex syntax- so, we can
be sure Xenu can process it.
1) What is \w ?
It catches ether a letter or a digit or underscore.
So, it is just a shorthand for [_a-z0-9]
So you can replace "\w" with "[_a-z0-9]" (without quotes) inside my regex
2) As for \s ...
While \s catches not anly spaces but also tabs and line breaks, I meant it
here to catch spaces only.
So you can replace all occurencies of "\s" with " " (without quotes) inside
3) Remove "(?:ftp|https?)://" part - we want relative URLs to be matched,
4) regex library Tilman uses might not support non-capturing parenthesis.
So we have to remove "(?:\s*,[^,]+?\s*)*" part
This part matches ,'something' repated zero ore more times
So it matches: ",'something','something','something','something'"
OK for regexp - because this part of match is optional - due to presense of
question mark right after the plus.
So we decided to remove this part. Having said "A" we have to say "B". I
mean we have to remove "\s*\);" part also. I hope you understand why - if
there is no more regex part to match parameters, the ");" tail will never
match, always making the whole regexp to fail. We do not want that.
5) regex library Tilman uses might not support lazy repetition. So we will
make our repetitions gready. You know where to read about meaning of these
So we repeat regex tokens not with "+?" but with "+".
Luckily this does not change regex match anyway.
Finally we get:
In plain English it is:
letters ranging from A to Z and/or digits and/or underscores, followed by
zero or more spaces, followed by opening bracket, followed by zero or more
spaces, followed by a single or a double quote, followed by a sequence of
symbols that are not single nor double quotes, followed by a single or a
We have trimmed the regex, and so we have made it less accurate. But if you
are sure you feed only valid HTML to Xenu, this is no problem.
> when i validate your regex with the regex tester found at
> www.forta.com, it nicely captured a URL such as
Nice to hear. It proves again that I have written my initial regex in a
> Let me cut this discussion short: did you test your regex with xenu 1.2g
No. I temporarily have no live internet access, but only email.
As soon as it is fixed, I will test and post my results.
Yahoo! Groups Sponsor
tions.yahoo.com/ydomains2004/index.html> click here
Yahoo! Groups Links
* To visit your group on the web, go to:
* To unsubscribe from this group, send an email to:
* Your use of Yahoo! Groups is subject to the Yahoo!
> Terms of Service.
[Non-text portions of this message have been removed]