> xenu crashes after a while.There are different regex libraries , each with its own pecularities in
PowerGREP author claims that his regex flavour is fully perl compatible.
And Perl is considered to be a kind of industry standard in regex syntax.
I tested my regexp in PowerGREP and it catches all variants in due way, URLs
or without file extensions, URLs ending with top domain or ending with GET
parameters like http://something.com?param_name=param_value
I suspect regex library Tilman uses does not support all Perl things.
So lets simpify our regex. It must use only basic regex syntax- so, we can
be sure Xenu can process it.
1) What is \w ?
It catches ether a letter or a digit or underscore.
So, it is just a shorthand for [_a-z0-9]
So you can replace "\w" with "[_a-z0-9]" (without quotes) inside my regex
2) As for \s ...
While \s catches not anly spaces but also tabs and line breaks, I meant it
here to catch spaces only.
So you can replace all occurencies of "\s" with " " (without quotes) inside
3) Remove "(?:ftp|https?)://" part - we want relative URLs to be matched,
4) regex library Tilman uses might not support non-capturing parenthesis.
So we have to remove "(?:\s*,[^,]+?\s*)*" part
This part matches ,'something' repated zero ore more times
So it matches: ",'something','something','something','something'"
OK for regexp - because this part of match is optional - due to presense of
question mark right after the plus.
So we decided to remove this part. Having said "A" we have to say "B". I
mean we have to remove "\s*\);" part also. I hope you understand why - if
there is no more regex part to match parameters, the ");" tail will never
match, always making the whole regexp to fail. We do not want that.
5) regex library Tilman uses might not support lazy repetition. So we will
make our repetitions gready. You know where to read about meaning of these
So we repeat regex tokens not with "+?" but with "+".
Luckily this does not change regex match anyway.
Finally we get:
In plain English it is:
letters ranging from A to Z and/or digits and/or underscores, followed by
zero or more spaces, followed by opening bracket, followed by zero or more
spaces, followed by a single or a double quote, followed by a sequence of
symbols that are not single nor double quotes, followed by a single or a
We have trimmed the regex, and so we have made it less accurate. But if you
are sure you feed only valid HTML to Xenu, this is no problem.
> when i validate your regex with the regex tester found atNice to hear. It proves again that I have written my initial regex in a
> www.forta.com, it nicely captured a URL such as
> Let me cut this discussion short: did you test your regex with xenu 1.2gNo. I temporarily have no live internet access, but only email.
As soon as it is fixed, I will test and post my results.
- Did anyone try to run Xenu with Sandboxie? Does it work? I'm thinking
about mentioning it on my web page so that paranoid people can use Xenu