Loading ...
Sorry, an error occurred while loading the content.

crawling a large site

Expand Messages
  • brandonmbyers01
    I ve used Xenu for years, and it s an outstanding program. I ve run into a problem, though: when it gets much beyond 500,000 total URL s (only 15-35% visited),
    Message 1 of 5 , Jun 24, 2009
    • 0 Attachment
      I've used Xenu for years, and it's an outstanding program. I've run into a problem, though: when it gets much beyond 500,000 total URL's (only 15-35% visited), my computer tells me the virtual memory is too full. I've got 1 GB of RAM, and the task manager claims ~ 2 GB are being used.

      I'd like to know what others have experienced when crawling a site of that size (or larger). Is it my hardware? Should I increase the virtual memory?

      I tend to always run it 1 thread at a time, so that my computer (and the server) don't get too bogged down. I imagine if I ran it much higher, I'd have a bigger problem with giant sites ... right?

      Anyway, thanks to Tilman, and thanks to everyone who's helped with improvements to the software.
      - Brandon
    • Tilman Hausherr
      If you re using a version before 1.3b, update it - I made several changes to save memory. 1 GB of RAM isn t much... Consider buying new RAM, its really
      Message 2 of 5 , Jun 24, 2009
      • 0 Attachment
        If you're using a version before 1.3b, update it - I made several
        changes to save memory.

        1 GB of RAM isn't much... Consider buying new RAM, its really
        unexpensive. I bought extra 2GB of RAM recently for myself, for about 50
        Euros (maybe about 70 USD). I was lucky enough to have 4 slots in my PC,
        and two were empty :)

        Tilman

        On Wed, 24 Jun 2009 20:52:03 -0000, brandonmbyers01 wrote:

        >I've used Xenu for years, and it's an outstanding program. I've run into a problem, though: when it gets much beyond 500,000 total URL's (only 15-35% visited), my computer tells me the virtual memory is too full. I've got 1 GB of RAM, and the task manager claims ~ 2 GB are being used.
        >
        >I'd like to know what others have experienced when crawling a site of that size (or larger). Is it my hardware? Should I increase the virtual memory?
        >
        >I tend to always run it 1 thread at a time, so that my computer (and the server) don't get too bogged down. I imagine if I ran it much higher, I'd have a bigger problem with giant sites ... right?
        >
        >Anyway, thanks to Tilman, and thanks to everyone who's helped with improvements to the software.
        >- Brandon
        >
        >
        >
        >------------------------------------
        >
        >Yahoo! Groups Links
        >
        >
        >
      • Thomas Fischer
        Hi Brandon, if Tilman s hints don t help, you might try to spilt your site into different sections and check them one at a time. You can use the Do not check
        Message 3 of 5 , Jun 25, 2009
        • 0 Attachment
          Hi Brandon,
           
          if Tilman's hints don't help, you might try to spilt your site into different sections and check them one at a time.
          You can use the "Do not check any URLs beginning with this:" to do this, if the site is structured in any way.
           
          Cheers
          Thomas


          Von: xenu-usergroup@yahoogroups.com [mailto:xenu-usergroup@yahoogroups.com] Im Auftrag von brandonmbyers01
          Gesendet: Mittwoch, 24. Juni 2009 22:52
          An: xenu-usergroup@yahoogroups.com
          Betreff: [xenu-usergroup] crawling a large site

          I've used Xenu for years, and it's an outstanding program. I've run into a problem, though: when it gets much beyond 500,000 total URL's (only 15-35% visited), my computer tells me the virtual memory is too full. I've got 1 GB of RAM, and the task manager claims ~ 2 GB are being used.

          I'd like to know what others have experienced when crawling a site of that size (or larger). Is it my hardware? Should I increase the virtual memory?

          I tend to always run it 1 thread at a time, so that my computer (and the server) don't get too bogged down. I imagine if I ran it much higher, I'd have a bigger problem with giant sites ... right?

          Anyway, thanks to Tilman, and thanks to everyone who's helped with improvements to the software.
          - Brandon

        • brandonmbyers01
          Thank you for the advice. I was running it on my home machine with 1 GB of RAM. At work, one of my machines has 4 GB. It looks like I m running version 1.3 in
          Message 4 of 5 , Jun 25, 2009
          • 0 Attachment
            Thank you for the advice. I was running it on my home machine with 1 GB of RAM. At work, one of my machines has 4 GB. It looks like I'm running version 1.3 in all locations, so I'll try upgrading it.

            And the funny thing is I am excluding large sections of the site for this crawl, though perhaps I can exclude more.

            One reason I was running it on the 1 GB home machine is that after awhile, it started timing out every ~250 URLs -- but only at work. I'm guessing if I compared the path of ISP's between work & the host to the ones between home & the host, the latter must have nicer peering agreements ... but I noticed the advice on this group yesterday to un-check the option to fail all URLs on a domain, so that should help me.

            Thanks for the insights, y'all ... if anyone else would like to discuss their experience crawling sites of this size, I'm curious to hear about it.
            -Brandon
          • brandonmbyers01
            Having upgraded to 1.3c, it runs for awhile on the 4 GB machine, but soon shows an Out of memory warning. Since I d turned off the fail domain option, it
            Message 5 of 5 , Jun 25, 2009
            • 0 Attachment
              Having upgraded to 1.3c, it runs for awhile on the 4 GB machine, but soon shows an "Out of memory" warning. Since I'd turned off the "fail domain" option, it showed many of these warnings, and then Xenu abruptly disappeared.

              I restarted, and tried opening the file but hitting pause immediately. Then I tried sorting the listings, but nothing would happen. (I'd wait awhile & come back to the computer, just in case) I also tried exporting to a tab-delimited file, but nothing would happen. (again, I waited)

              I suppose I just need to subdivide the site further.

              Still, I can't complain; it's still the best program I've used for this, and all the other sites I've crawled have worked out fine (usually 180-230,000 URLs in those).

              -Brandon
            Your message has been successfully submitted and would be delivered to recipients shortly.