Loading ...
Sorry, an error occurred while loading the content.

Heritrix Performance

Expand Messages
  • alxartes
    Hi, Hope a generous person point me to the right direction. I have done a number of crawls already using heritrix. I want to know if 6KB/s is an acceptable
    Message 1 of 3 , Feb 7, 2006
    • 0 Attachment
      Hi,

      Hope a generous person point me to the right direction.

      I have done a number of crawls already using heritrix. I want to know
      if 6KB/s is an acceptable bandwidth? We have a tight schedule and want
      to crawl at the fastest possible.

      Also, I understand that Heritrix only limits a crawl of 1 uri per host
      at a time for politeness. Is it possible to circumvent this since we
      have the permission of the web owners to crawl their website?

      Aside from above, what could be the reason for the slow crawling? The
      following is the result when I use TOP in unix.

      Cpu(s): 0.3% us, 0.2% sy, 0.0% ni, 99.3% id, 0.0% wa, 0.2% hi,
      0.0% si
      Mem: 1035664k total, 945060k used, 90604k free, 91592k buffers
      Swap: 2040244k total, 128092k used, 1912152k free, 252684k cached

      Thanks.
    • stack
      ... 6KB/s seems low. How many threads are you running? Are they all occupied all the time? Has the crawl just started or is the 6KB/s a measure taken after
      Message 2 of 3 , Feb 7, 2006
      • 0 Attachment
        alxartes wrote:
        > Hi,
        >
        > Hope a generous person point me to the right direction.
        >
        > I have done a number of crawls already using heritrix. I want to know
        > if 6KB/s is an acceptable bandwidth? We have a tight schedule and want
        > to crawl at the fastest possible.
        6KB/s seems low. How many threads are you running? Are they all
        occupied all the time? Has the crawl just started or is the 6KB/s a
        measure taken after crawling a while. What kind of hardware and what is
        theoretical upper bound on the pipe you are using?

        On fairly basic hardware with 200 threads, minutes after startup doing a
        broad crawl, a megabyte per second and higher is not unusual.

        >
        > Also, I understand that Heritrix only limits a crawl of 1 uri per host
        > at a time for politeness. Is it possible to circumvent this since we
        > have the permission of the web owners to crawl their website?
        >
        Not currently. The model of one-URI-per-host-at-a-time is fairly deeply
        ingrained.

        > Aside from above, what could be the reason for the slow crawling? The
        > following is the result when I use TOP in unix.
        >
        > Cpu(s): 0.3% us, 0.2% sy, 0.0% ni, 99.3% id, 0.0% wa, 0.2% hi,
        > 0.0% si
        > Mem: 1035664k total, 945060k used, 90604k free, 91592k buffers
        > Swap: 2040244k total, 128092k used, 1912152k free, 252684k cached
        >
        > Thanks.

        Your top seems to show your crawler idle most of the time (If I'm
        reading it correctly). Study the crawler reports, particular the thread
        reports. Are they all just in wait all the time? How many hosts are
        you crawling? One? If so, only one thread will be active against that
        host. Have you played with the min-delay-ms settings changing it from
        default of 2000?

        St.Ack

        >
        >
        >
        >
        >
        > SPONSORED LINKS
        > Computer security
        > <http://groups.yahoo.com/gads?t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2&s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg>
        > Computer training
        > <http://groups.yahoo.com/gads?t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2&s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ>
        >
        >
        >
        > ------------------------------------------------------------------------
        > YAHOO! GROUPS LINKS
        >
        > * Visit your group "archive-crawler
        > <http://groups.yahoo.com/group/archive-crawler>" on the web.
        >
        > * To unsubscribe from this group, send an email to:
        > archive-crawler-unsubscribe@yahoogroups.com
        > <mailto:archive-crawler-unsubscribe@yahoogroups.com?subject=Unsubscribe>
        >
        > * Your use of Yahoo! Groups is subject to the Yahoo! Terms of
        > Service <http://docs.yahoo.com/info/terms/>.
        >
        >
        > ------------------------------------------------------------------------
        >
      • alxartes
        Thank you very much Stack. Usually, I only do domain crawl of one-three seeds at a time. I guess that is the real reason for the low bandwidth speed. ... know
        Message 3 of 3 , Feb 9, 2006
        • 0 Attachment
          Thank you very much Stack.

          Usually, I only do domain crawl of one-three seeds at a time. I guess
          that is the real reason for the low bandwidth speed.

          --- In archive-crawler@yahoogroups.com, stack <stack@...> wrote:
          >
          > alxartes wrote:
          > > Hi,
          > >
          > > Hope a generous person point me to the right direction.
          > >
          > > I have done a number of crawls already using heritrix. I want to
          know
          > > if 6KB/s is an acceptable bandwidth? We have a tight schedule and
          want
          > > to crawl at the fastest possible.
          > 6KB/s seems low. How many threads are you running? Are they all
          > occupied all the time? Has the crawl just started or is the 6KB/s
          a
          > measure taken after crawling a while. What kind of hardware and
          what is
          > theoretical upper bound on the pipe you are using?
          >
          > On fairly basic hardware with 200 threads, minutes after startup
          doing a
          > broad crawl, a megabyte per second and higher is not unusual.
          >
          > >
          > > Also, I understand that Heritrix only limits a crawl of 1 uri per
          host
          > > at a time for politeness. Is it possible to circumvent this since
          we
          > > have the permission of the web owners to crawl their website?
          > >
          > Not currently. The model of one-URI-per-host-at-a-time is fairly
          deeply
          > ingrained.
          >
          > > Aside from above, what could be the reason for the slow crawling?
          The
          > > following is the result when I use TOP in unix.
          > >
          > > Cpu(s): 0.3% us, 0.2% sy, 0.0% ni, 99.3% id, 0.0% wa, 0.2%
          hi,
          > > 0.0% si
          > > Mem: 1035664k total, 945060k used, 90604k free, 91592k
          buffers
          > > Swap: 2040244k total, 128092k used, 1912152k free, 252684k
          cached
          > >
          > > Thanks.
          >
          > Your top seems to show your crawler idle most of the time (If I'm
          > reading it correctly). Study the crawler reports, particular the
          thread
          > reports. Are they all just in wait all the time? How many hosts
          are
          > you crawling? One? If so, only one thread will be active against
          that
          > host. Have you played with the min-delay-ms settings changing it
          from
          > default of 2000?
          >
          > St.Ack
          >
          > >
          > >
          > >
          > >
          > >
          > > SPONSORED LINKS
          > > Computer security
          > > <http://groups.yahoo.com/gads?
          t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2
          &s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg>
          > > Computer training
          > > <http://groups.yahoo.com/gads?
          t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2
          &s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ>
          > >
          > >
          > >
          > > ------------------------------------------------------------------
          ------
          > > YAHOO! GROUPS LINKS
          > >
          > > * Visit your group "archive-crawler
          > > <http://groups.yahoo.com/group/archive-crawler>" on the web.
          > >
          > > * To unsubscribe from this group, send an email to:
          > > archive-crawler-unsubscribe@yahoogroups.com
          > > <mailto:archive-crawler-unsubscribe@yahoogroups.com?
          subject=Unsubscribe>
          > >
          > > * Your use of Yahoo! Groups is subject to the Yahoo! Terms of
          > > Service <http://docs.yahoo.com/info/terms/>.
          > >
          > >
          > > ------------------------------------------------------------------
          ------
          > >
          >
        Your message has been successfully submitted and would be delivered to recipients shortly.