Loading ...
Sorry, an error occurred while loading the content.
 

Re: [archive-crawler] BDB, state, and disk usage, and other questions

Expand Messages
  • stack
    ... Igor just took a look at a recent crawl and found that state was 15% the size of the arcs captured. I just tried a crawl against the infiniteurl
    Message 1 of 9 , May 23, 2005
      Tom Emerson wrote:

      > I have a largish (2308 seeds, inferred SURT scope) crawl that has been
      > running for 115 hours under 1.4.0. So far I've discovered 9,870,579
      > documents and crawled 3,326,501. This consumes approximately 17 GB on
      > disk for the ARC files, and 34 GB on disk for the BDB state. It looks,
      > from the ad hoc monitoring that I've been doing on disk usage, that
      > there is a roughly linear 2:1 ratio in state storage to ARC
      > storage. Does anyone have a feel if this is the expected ration? We
      > will be done many more larger crawls than this, and need to budget
      > disk space appropriately. (Note that I'm only crawling storing
      > text/html, so I get good compression in the arcs (so far around 4:1).

      Igor just took a look at a recent crawl and found that state was 15% the
      size of the arcs captured.

      I just tried a crawl against the infiniteurl application (infiniteurl is
      a simple webapp that we use here for testing. It manufactures new URLs
      ad infinitum returning simple uniform html pages all of same approximate
      size). I'm seeing ratios of 50 to 1 -- 50 times the arc data saved is
      needed for state. My crawl scenario is highly artificial.

      Seems like it depends highly on the character of the crawl being run
      (Capturing HTML only Tom, you're throwing out alot of what usually
      bulks-up the arcs: images, pdfs, etc.).

      >
      > Also, while I have configured a maximum of 100 toe threads, at this
      > point only 92 are active (i.e., the dashboard states "91 of
      > 92"). About 10% of the seeds haven't even been processed yet: I take
      > it this doesn't happen until the other queues have been completely
      > exhausted?


      You have hold-queues enabled? Then this is what I'd expect (You might
      change the 'balance-replenish-amount' setting so its less than 3000 so
      other queues get rotated in quicker).

      Igor asks if the uncrawled seeds are showing in the seeds report -- are
      they recognized as uncrawled (There's an issue where we recognize seeds
      only if they have a scheme OR they look like a host name -- otherwise,
      they're skipped).

      >
      > The threads report states that there are 92 in the pool, but lists
      > status for 100. How can I find the 8 that aren't being used?
      >
      We ain't sure why it would report any less than 100 threads in the
      pool. Anything in heritrix_out.log? Runtime/local errors? Send us over
      the report so we can take a look.

      Thanks Tom,
      St.Ack

      > TIA,
      >
      > -tree
      >
      > --
      > Tom Emerson Basis Technology
      > Corp.
      > Software Architect
      > http://www.basistech.com
      > "Beware the lollipop of mediocrity: lick it once and you suck forever"
      >
      > ------------------------------------------------------------------------
      > *Yahoo! Groups Links*
      >
      > * To visit your group on the web, go to:
      > http://groups.yahoo.com/group/archive-crawler/
      >
      > * To unsubscribe from this group, send an email to:
      > archive-crawler-unsubscribe@yahoogroups.com
      > <mailto:archive-crawler-unsubscribe@yahoogroups.com?subject=Unsubscribe>
      >
      > * Your use of Yahoo! Groups is subject to the Yahoo! Terms of
      > Service <http://docs.yahoo.com/info/terms/>.
      >
      >
    • Tom Emerson
      ... I just checked and I m still around 2:1 state-vs-arcs on my HTML only crawl. It may be worth putting a discussion of this into the docs somewhere. ... Yes,
      Message 2 of 9 , May 24, 2005
        stack writes:
        > Seems like it depends highly on the character of the crawl being run
        > (Capturing HTML only Tom, you're throwing out alot of what usually
        > bulks-up the arcs: images, pdfs, etc.).

        I just checked and I'm still around 2:1 state-vs-arcs on my HTML only
        crawl. It may be worth putting a discussion of this into the docs
        somewhere.

        > You have hold-queues enabled? Then this is what I'd expect (You might
        > change the 'balance-replenish-amount' setting so its less than 3000 so
        > other queues get rotated in quicker).

        Yes, hold-queues is enabled.

        Is there a description of the balance-replenish-amount,
        queue-total-budget, and cost-policy settings outside of the source
        code?

        > Igor asks if the uncrawled seeds are showing in the seeds report -- are
        > they recognized as uncrawled (There's an issue where we recognize seeds
        > only if they have a scheme OR they look like a host name -- otherwise,
        > they're skipped).

        Yes, they are showing up as uncrawled. All the URLs were extracted
        from the DMOZ catalog and are in canonical form.

        > > The threads report states that there are 92 in the pool, but lists
        > > status for 100. How can I find the 8 that aren't being used?
        > >
        > We ain't sure why it would report any less than 100 threads in the
        > pool. Anything in heritrix_out.log? Runtime/local errors? Send us over
        > the report so we can take a look.

        It's down to "85 of 89" threads now.

        THere have been no alerts. However, there are 11 NPEs in the log:

        java.lang.NullPointerException
        at org.archive.util.CachedBdbMap$SoftEntry.clearPhantom(CachedBdbMap.java:508)
        at org.archive.util.CachedBdbMap.expungeStaleEntry(CachedBdbMap.java:456)
        at org.archive.util.CachedBdbMap.get(CachedBdbMap.java:343)
        at org.archive.crawler.frontier.BdbFrontier.next(BdbFrontier.java:514)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:135)

        These would account for the 11 missing threads.

        Thanks.

        -tree

        P.S. Status --- 3,943,333 of 11,406,315 documents downloaded. 46,389 queues.

        --
        Tom Emerson Basis Technology Corp.
        Software Architect http://www.basistech.com
        "Beware the lollipop of mediocrity: lick it once and you suck forever"
      • Gordon Mohr
        ... There are BDB parameters which control how aggressively it consolidates and discards older, no longer full JDB files. See especially
        Message 3 of 9 , May 24, 2005
          Tom Emerson wrote:
          > stack writes:
          >
          >>Seems like it depends highly on the character of the crawl being run
          >>(Capturing HTML only Tom, you're throwing out alot of what usually
          >>bulks-up the arcs: images, pdfs, etc.).
          >
          >
          > I just checked and I'm still around 2:1 state-vs-arcs on my HTML only
          > crawl. It may be worth putting a discussion of this into the docs
          > somewhere.

          There are BDB parameters which control how aggressively it consolidates
          and discards older, no longer full JDB files. See especially
          "je.cleaner.minUtilization". At a cost of more CPU and IO, setting
          more aggressive cleaning would cut back on disk footprint.

          I suppose it's also possible that the current CPU-intensity of the
          crawler is causing the BDB cleaner to get behind, but I haven't looked
          into testing this hypothesis. Also possible, that something about our
          serialization is too expansive -- serializing out stuff into the
          queues that doesn't need to be there, perhaps because it could be
          costlessly reconstructed later.

          (Previously discussed a bit here:
          http://groups.yahoo.com/group/archive-crawler/message/1758 )

          >>You have hold-queues enabled? Then this is what I'd expect (You might
          >>change the 'balance-replenish-amount' setting so its less than 3000 so
          >>other queues get rotated in quicker).
          >
          >
          > Yes, hold-queues is enabled.
          >
          > Is there a description of the balance-replenish-amount,
          > queue-total-budget, and cost-policy settings outside of the source
          > code?

          Until these details get integrated into the User Manual,
          you can learn about these "budgetting" settings in the wiki:

          http://crawler.archive.org/cgi-bin/wiki.pl?BudgetingFrontier

          Any non-zero cost policy plus smallish refresh-budget will
          ensure that, in the 'hold-queues' case, after a queue is active
          for a while, it gets put to the back of the inactive queues, and
          another waiting queue gets a chance to be active for a while.
          (If in a long running crawl you see seeds on queues that have
          never been activated, you probably have either a zero cost
          policy or a very-large/infinite refresh budget.)

          >>>The threads report states that there are 92 in the pool, but lists
          >>>status for 100. How can I find the 8 that aren't being used?
          >>>
          >>
          >>We ain't sure why it would report any less than 100 threads in the
          >>pool. Anything in heritrix_out.log? Runtime/local errors? Send us over
          >>the report so we can take a look.
          >
          >
          > It's down to "85 of 89" threads now.
          >
          > THere have been no alerts. However, there are 11 NPEs in the log:
          >
          > java.lang.NullPointerException
          > at org.archive.util.CachedBdbMap$SoftEntry.clearPhantom(CachedBdbMap.java:508)
          > at org.archive.util.CachedBdbMap.expungeStaleEntry(CachedBdbMap.java:456)
          > at org.archive.util.CachedBdbMap.get(CachedBdbMap.java:343)
          > at org.archive.crawler.frontier.BdbFrontier.next(BdbFrontier.java:514)
          > at org.archive.crawler.framework.ToeThread.run(ToeThread.java:135)
          >
          > These would account for the 11 missing threads.

          Any deviation of the second/total number from the configured number
          of ToeThreads is a definite bug -- it looks like these NPEs are to
          blame.

          With enough 'ready' queues of material to crawl, the first number,
          of active threads, should ideally be equal or only momentarily less
          than the total number of threads. However, especially on faster
          machines and latest JVMs/OS threading, we've been seeing more of a
          gap here recently. I suspect thread scheduling may not be as 'fair'
          as previously, so some threads waiting to get an item are overtaken
          by others. As long as CPU is saturated, though, forcing more fairness
          would just slow all threads equally, and thus this may not be a major
          concern, and the right adaptation could be to shrink the number of
          poool threads.

          - Gordon @ IA
        • Gordon Mohr
          ... Some quick investigation revealed no giant things being mistakenly serialized out, but revealed some opportunities for small changes that have cut the
          Message 4 of 9 , May 25, 2005
            I wrote yesterday:
            > Also possible, that something about our
            > serialization is too expansive -- serializing out stuff into the
            > queues that doesn't need to be there, perhaps because it could be
            > costlessly reconstructed later.
            >
            > (Previously discussed a bit here:
            > http://groups.yahoo.com/group/archive-crawler/message/1758 )

            Some quick investigation revealed no giant things being mistakenly
            serialized out, but revealed some opportunities for small changes that
            have cut the serialized size of CrawlURIs by 60% or more in short tests.
            (From averages of ~1100 bytes to ~420 bytes in some short test crawls
            generating ~20000 serialized CrawlURIs.)

            As serialized CrawlURIs in queues dominate our BDB contents -- both
            the in-memory cache, and on-disk files -- this could result in a
            noticeably smaller disk footprint (and effectively "larger" cache
            at the same memory cost).

            These changes have been committed to HEAD after some minimal testing,
            so those willing to risk the bleeding edge are invited to try them
            out.

            - Gordon @ IA
          • Christian Kohlschuetter
            ... Hello Gordon, I think these changes have broken something. With the changes applied, I get URIExceptions (usually hidden behind BDB s
            Message 5 of 9 , May 30, 2005
              On Wednesday 25 May 2005 23:44, Gordon Mohr wrote:
              > Some quick investigation revealed no giant things being mistakenly
              > serialized out, but revealed some opportunities for small changes that
              > have cut the serialized size of CrawlURIs by 60% or more in short tests.
              > (From averages of ~1100 bytes to ~420 bytes in some short test crawls
              > generating ~20000 serialized CrawlURIs.)
              >
              > As serialized CrawlURIs in queues dominate our BDB contents -- both
              > the in-memory cache, and on-disk files -- this could result in a
              > noticeably smaller disk footprint (and effectively "larger" cache
              > at the same memory cost).
              >
              > These changes have been committed to HEAD after some minimal testing,
              > so those willing to risk the bleeding edge are invited to try them
              > out.
              >
              > - Gordon @ IA

              Hello Gordon,

              I think these changes have broken something.

              With the changes applied, I get URIExceptions (usually hidden behind BDB's
              RuntimeExceptionWrapper), with messages like
              - "Relative URI but no
              base: :http:/www.imagesjournal.com/issue02/reviews/mannoirs.htm"
              - "Invalid URL encoding" (happens if URI's last character is '%').

              Here is a partial stacktrace from my own "NewFrontier" implementation (it has
              the same problems as the BdbFrontier, but also shows the exception's cause):

              Caused by: org.apache.commons.httpclient.URIException: Invalid URL encoding
              at org.apache.commons.httpclient.URI.decode(URI.java:1768)
              at org.apache.commons.httpclient.URI.decode(URI.java:1724)
              at org.apache.commons.httpclient.URI.getURI(URI.java:3743)
              at
              org.archive.crawler.datamodel.CandidateURI.writeObject(CandidateURI.java:517)
              at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
              at
              sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
              at java.lang.reflect.Method.invoke(Method.java:585)
              at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:890)
              at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1333)
              at
              java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1284)
              at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1073)
              at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:291)
              at
              de.kohlschuetter.collections.queues.BucketQueue.enqueue(BucketQueue.java:221)
              at org.archive.crawler.frontier.NewWorkQueue.insertItem(NewWorkQueue.java:53)
              at org.archive.crawler.frontier.WorkQueue.insert(WorkQueue.java:352)
              at org.archive.crawler.frontier.WorkQueue.enqueue(WorkQueue.java:122)
              ... 12 more

              I am not sure how/why "bad URIs" reached that point, but without your patch, I
              do not get any exceptions.
              --
              Christian Kohlschütter
              mailto: ck -at- NewsClub.de
            • Gordon Mohr
              Thanks for the report. It seems that problem URIs that would be harmlessly ignored at a later step are now causing problems in the optimized serialization.
              Message 6 of 9 , May 31, 2005
                Thanks for the report. It seems that problem URIs that would be
                harmlessly ignored at a later step are now causing problems in
                the optimized serialization.

                I've been able to reproduce the "Relative URI but no base"
                exception on serialization, but not the "Invalid URL encoding"
                esception on deserialization, even trying a URI with trailing
                '%'. Can you suggest a URI or page context where this is
                triggered?

                I think the fix will be twofold: (1) ensure these bad URIs are
                discarded earlier, so they are never queued; (2) making the
                queueing more robust against any case where a bad URI slips
                through.

                - Gordon @ IA

                Christian Kohlschuetter wrote:
                > On Wednesday 25 May 2005 23:44, Gordon Mohr wrote:
                >
                >>Some quick investigation revealed no giant things being mistakenly
                >>serialized out, but revealed some opportunities for small changes that
                >>have cut the serialized size of CrawlURIs by 60% or more in short tests.
                >>(From averages of ~1100 bytes to ~420 bytes in some short test crawls
                >>generating ~20000 serialized CrawlURIs.)
                >>
                >>As serialized CrawlURIs in queues dominate our BDB contents -- both
                >>the in-memory cache, and on-disk files -- this could result in a
                >>noticeably smaller disk footprint (and effectively "larger" cache
                >>at the same memory cost).
                >>
                >>These changes have been committed to HEAD after some minimal testing,
                >>so those willing to risk the bleeding edge are invited to try them
                >>out.
                >>
                >>- Gordon @ IA
                >
                >
                > Hello Gordon,
                >
                > I think these changes have broken something.
                >
                > With the changes applied, I get URIExceptions (usually hidden behind BDB's
                > RuntimeExceptionWrapper), with messages like
                > - "Relative URI but no
                > base: :http:/www.imagesjournal.com/issue02/reviews/mannoirs.htm"
                > - "Invalid URL encoding" (happens if URI's last character is '%').
                >
                > Here is a partial stacktrace from my own "NewFrontier" implementation (it has
                > the same problems as the BdbFrontier, but also shows the exception's cause):
                >
                > Caused by: org.apache.commons.httpclient.URIException: Invalid URL encoding
                > at org.apache.commons.httpclient.URI.decode(URI.java:1768)
                > at org.apache.commons.httpclient.URI.decode(URI.java:1724)
                > at org.apache.commons.httpclient.URI.getURI(URI.java:3743)
                > at
                > org.archive.crawler.datamodel.CandidateURI.writeObject(CandidateURI.java:517)
                > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
                > at
                > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                > at java.lang.reflect.Method.invoke(Method.java:585)
                > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:890)
                > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1333)
                > at
                > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1284)
                > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1073)
                > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:291)
                > at
                > de.kohlschuetter.collections.queues.BucketQueue.enqueue(BucketQueue.java:221)
                > at org.archive.crawler.frontier.NewWorkQueue.insertItem(NewWorkQueue.java:53)
                > at org.archive.crawler.frontier.WorkQueue.insert(WorkQueue.java:352)
                > at org.archive.crawler.frontier.WorkQueue.enqueue(WorkQueue.java:122)
                > ... 12 more
                >
                > I am not sure how/why "bad URIs" reached that point, but without your patch, I
                > do not get any exceptions.
              • Gordon Mohr
                ... Oops, reverse those, I am getting the no base exception on *de*serialization, but haven t yet been able to reproduce the Invalid URL encoding on
                Message 7 of 9 , May 31, 2005
                  I wrote:
                  > I've been able to reproduce the "Relative URI but no base"
                  > exception on serialization, but not the "Invalid URL encoding"
                  > esception on deserialization, even trying a URI with trailing
                  > '%'. Can you suggest a URI or page context where this is
                  > triggered?

                  Oops, reverse those, I am getting the "no base" exception on
                  *de*serialization, but haven't yet been able to reproduce the
                  "Invalid URL encoding" on *serialization*.

                  - Gordon @ IA
                • Christian Kohlschuetter
                  ... The error might only occur if the % is at the end of the query string (see stacktrace). ... I would consider passing invalid URIs the queue as a bug. We
                  Message 8 of 9 , Jun 1, 2005
                    On Tuesday 31 May 2005 21:47, Gordon Mohr wrote:
                    > Thanks for the report. It seems that problem URIs that would be
                    > harmlessly ignored at a later step are now causing problems in
                    > the optimized serialization.
                    >
                    > I've been able to reproduce the "Relative URI but no base"
                    > exception on serialization, but not the "Invalid URL encoding"
                    > esception on deserialization, even trying a URI with trailing
                    > '%'. Can you suggest a URI or page context where this is
                    > triggered?

                    The error might only occur if the '%' is at the end of the query string (see
                    stacktrace).

                    > I think the fix will be twofold: (1) ensure these bad URIs are
                    > discarded earlier, so they are never queued; (2) making the
                    > queueing more robust against any case where a bad URI slips
                    > through.

                    I would consider passing invalid URIs the queue
                    as a bug. We should concentrate on (1).

                    In this special case ('%'), I suggest to modify the URI parser instead -- we
                    are already very generous when parsing bad URIs (and my Firefox browser
                    happily accepts that URI, too).


                    All the best,

                    Christian


                    Stacktrace:
                    java.lang.RuntimeException: Could not enqueue
                    http://www.ecmrecords.com/Catalogue/ECM/1700/1792.php?lvredir=712&cat=%2FArtists%
                    -- message: Invalid URL encoding
                    at org.archive.crawler.frontier.WorkQueue.enqueue(WorkQueue.java:126)
                    at
                    org.archive.crawler.frontier.WorkQueueFrontier.sendToQueue(WorkQueueFrontier.java:338)
                    at
                    org.archive.crawler.frontier.WorkQueueFrontier.receive(WorkQueueFrontier.java:325)
                    at org.archive.crawler.util.NewUriUniqFilter.addNow(NewUriUniqFilter.java:90)
                    at org.archive.crawler.util.NewUriUniqFilter.add(NewUriUniqFilter.java:68)
                    at
                    org.archive.crawler.frontier.WorkQueueFrontier.schedule(WorkQueueFrontier.java:308)
                    at
                    org.archive.crawler.frontier.AbstractFrontier.loadSeeds(AbstractFrontier.java:472)
                    at
                    org.archive.crawler.frontier.WorkQueueFrontier.initialize(WorkQueueFrontier.java:230)
                    at
                    org.archive.crawler.framework.CrawlController.setupCrawlModules(CrawlController.java:572)
                    at
                    org.archive.crawler.framework.CrawlController.initialize(CrawlController.java:336)
                    at
                    org.archive.crawler.admin.CrawlJobHandler.startNextJobInternal(CrawlJobHandler.java:1066)
                    at org.archive.crawler.admin.CrawlJobHandler$2.run(CrawlJobHandler.java:1032)
                    at java.lang.Thread.run(Thread.java:595)
                    Caused by: org.apache.commons.httpclient.URIException: Invalid URL encoding
                    at org.apache.commons.httpclient.URI.decode(URI.java:1768)
                    at org.apache.commons.httpclient.URI.decode(URI.java:1724)
                    at org.apache.commons.httpclient.URI.getURI(URI.java:3743)
                    at
                    org.archive.crawler.datamodel.CandidateURI.writeObject(CandidateURI.java:517)
                    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
                    at
                    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                    at java.lang.reflect.Method.invoke(Method.java:585)
                    at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:890)
                    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1333)
                    at
                    java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1284)
                    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1073)
                    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:291)
                    at
                    de.kohlschuetter.collections.queues.BucketQueue.enqueue(BucketQueue.java:221)
                    at org.archive.crawler.frontier.NewWorkQueue.insertItem(NewWorkQueue.java:53)
                    at org.archive.crawler.frontier.WorkQueue.insert(WorkQueue.java:352)
                    at org.archive.crawler.frontier.WorkQueue.enqueue(WorkQueue.java:122)
                    ... 12 more
                    --
                    Christian Kohlschütter
                    mailto: ck -at- NewsClub.de
                  Your message has been successfully submitted and would be delivered to recipients shortly.