Re: [xenu-usergroup] how to properly crawl Internet Archive internal paths
- View SourceThe bug is caused by an incorrect "<base href". According to
it should have an absolute URL, but archive.org doesn't respect this.
I have corrected Xenu so that it only uses bases with http or https, get it here:
Am 02.09.2013 19:57, schrieb Tilman Hausherr:Sorry for not answering immediately your mail to me.
Anyway, I did open the XEN files, and I think that there's a bug in Xenu, that it does not process
correctly. Apparently it thinks that this is an absolute URL, which it isn't.
Although it makes no sense to verify archive.org sites (because you can't correct them), I'll research this anyway and hopefully find the bug and correct it. Be patient.
Am 02.09.2013 16:11, schrieb Melvin Solorio:Hi, I am reporting a bug, have included a .XEN file like you requested on the official software page and sent you a message but maybe it went to a wrong/unexisting email address so I never got it answered:
This is not actually a bug, but more of an unexpected behavior which is quite confusing for me... Please explain what is the best way to access an internal page on the Internet Archive website through a relative path and verify it successfully? Xenu doesn't seem to work with the relative interlinking pattern properly (<a href="/web/20121119162959/http://www.perewozkin.ru/dostavka_gruzov" instead of its full URL with the domain name - <a href="http://web.archive.org/web/20121119162959/http://www.perewozkin.ru/dostavka_gruzov">- please check the HTML source of my starting url for further reference). I've used following methods but it returns either "skip external" or "not found" status for relative paths despite the fact that page actually exists:
STARTING URL is:
1) if following wildcards are set to be INCLUDED into link list:
there is a bunch of internal links beginning with '/web/' which are identified as external at the result of verification;
2) while if wildcards are set for links to be INCLUDED are:
*/web/*/http*perewozkin.ru/* or even
all the aforementioned links matching the wildcard are not seen/verified and marked as "not found". Obviously Xenu simply doesn't crawl them since 'http://web.archive.org' part is missing. Why doesn't it insert the domain name automatically? You can see .XEN files and exported .TXT files for both cases at:
I'm currently using the Xenu version which allows wildcards. Maybe it could be possible to access the page with 'http://web.archive.org/' only as starting URL, but I need to scan certain websites' archive copies only and somehow guide the program in which exactly websites' pages I need it to find and verify. Please kindly consider helping me with this issue and sorry for my poor English (I am from Russia).