Loading ...
Sorry, an error occurred while loading the content.

Re: [xenu-usergroup] how to properly crawl Internet Archive internal paths

Expand Messages
  • Tilman Hausherr
    The bug is caused by an incorrect
    Message 1 of 3 , Sep 2, 2013
    View Source
    • 0 Attachment
      The bug is caused by an incorrect "<base href". According to
      http://www.w3.org/TR/WD-html40-970708/struct/links.html#edef-BASE
      it should have an absolute URL, but archive.org doesn't respect this.

      I have corrected Xenu so that it only uses bases with http or https, get it here:
      http://home.snafu.de/tilman/tmp/xenubeta.zip

      Tilman

      Am 02.09.2013 19:57, schrieb Tilman Hausherr:
      Sorry for not answering immediately your mail to me.

      Anyway, I did open the XEN files, and I think that there's a bug in Xenu, that it does not process

      /web/20121119162959cs_/http://www.perewozkin.ru/templates/perewozkin/css/full_css.css?templates/perewozkin/

      correctly. Apparently it thinks that this is an absolute URL, which it isn't.

      Although it makes no sense to verify archive.org sites (because you can't correct them), I'll research this anyway and hopefully find the bug and correct it. Be patient.

      Tilman

      Am 02.09.2013 16:11, schrieb Melvin Solorio:
      Hi, I am reporting a bug, have included a .XEN file like you requested on the official software page and sent you a message but maybe it went to a wrong/unexisting email address so I never got it answered:

      This is not actually a bug, but more of an unexpected behavior which is quite confusing for me... Please explain what is the best way to access an internal page on the Internet Archive website through a relative path and verify it successfully? Xenu doesn't seem to work with the relative interlinking pattern properly (<a href="/web/20121119162959/http://www.perewozkin.ru/dostavka_gruzov" instead of its full URL with the domain name - <a href="http://web.archive.org/web/20121119162959/http://www.perewozkin.ru/dostavka_gruzov">- please check the HTML source of my starting url for further reference). I've used following methods but it returns either "skip external" or "not found" status for relative paths despite the fact that page actually exists:

      STARTING URL is:
      http://web.archive.org/web/20121119162959/http://www.perewozkin.ru/

      1) if following wildcards are set to be INCLUDED into link list:

      http://web.archive.org/web/*/perewozkin.ru*

      there is a bunch of internal links beginning with '/web/' which are identified as external at the result of verification;



      2) while if wildcards are set for links to be INCLUDED are:

      http://web.archive.org/web/*/perewozkin.ru*
      */web/*/http*perewozkin.ru/*
      or even
      /web/*perewozkin.ru*


      all the aforementioned links matching the wildcard are not seen/verified and marked as "not found". Obviously Xenu simply doesn't crawl them since 'http://web.archive.org' part is missing. Why doesn't it insert the domain name automatically? You can see .XEN files and exported .TXT files for both cases at:

      https://dl.dropboxusercontent.com/u/3644076/relative_path_as_external.XEN
      https://dl.dropboxusercontent.com/u/3644076/relative_path_as_broken.XEN

      https://dl.dropboxusercontent.com/u/3644076/relative_path_as_external.txt
      https://dl.dropboxusercontent.com/u/3644076/relative_path_as_broken.txt

      I'm currently using the Xenu version which allows wildcards. Maybe it could be possible to access the page with 'http://web.archive.org/' only as starting URL, but I need to scan certain websites' archive copies only and somehow guide the program in which exactly websites' pages I need it to find and verify. Please kindly consider helping me with this issue and sorry for my poor English (I am from Russia).

      Regards,
      Marlen


    Your message has been successfully submitted and would be delivered to recipients shortly.