Loading ...
Sorry, an error occurred while loading the content.
 

Re: [archive-crawler] Re: Large crawl experience (like, 500M links)

Expand Messages
  • Gordon Mohr (archive.org)
    ... We ve run into problems under 64bit JVMs, and they seem mostly attributable to the fact that the JVM s object pointers are larger and thus the same object
    Message 1 of 12 , Dec 19, 2005
      joehung302 wrote:
      > I did a proof crawling using BroadScope and 22K seeds. I got OOME
      > within a day. I then checkpoint it, restart the crawler, start
      > another crawl from the checkpoint, OOME within a day.
      >
      > I then changed to use 5K seeds and BroadScope, OOME within a day.
      > Restart with the checkpoint and still OOME within a day.
      >
      > I then run 5K seeds with DomainScope (kind of given up on
      > broadscope). OOME within a day.
      >
      > I have my JVM set to -Xmx1500m. BTW, I'm using 64 bit JDK1.5.
      >
      > One thing that I observed is, broad scope runs much faster than
      > domain scope under roughly the same condition. In both broadscope
      > runs I was able to top 1000KB/s bandwidth limit with around 50% cpu
      > usage. In the domain scope run I can only get to 500KB/s throughput
      > with 100% cpu busy.
      >
      > I used to be able to run 1.0.4 for a week with <1K seeds and get
      > around 1M links per day. I thought the bdb improvement should be
      > able to take more seeds and run longer. I really want the crawler to
      > run with a big seed list because we're going to seed my big crawl
      > with links from ODP.
      >
      > Any suggestions that I can try?

      We've run into problems under 64bit JVMs, and they seem mostly
      attributable to the fact that the JVM's object pointers are larger
      and thus the same object structures will take up more RAM.

      This post from a Sun engineer suggests a rule of thumb of a 40%
      larger heap to be comparable to a 32bit JVM heap:

      http://forum.java.sun.com/thread.jspa?threadID=671184
      (see reply #8)

      So your 1500m heap in a 64bit JVM may be roughly comparable to a
      1071m heap in a 32bit JVM.

      Further, as noted in the 1.6 release notes, BerkeleyDB-JE 2.0.90's
      internal mechanisms for staying within the budgetted cache size
      are inaccurate under 64bit JVMs, so rather than the default 60%
      cache size, 40% or even 30% would be safer.

      Even with these adjustments, there are still a few structures in
      the frontier that slowly grow without bound in a broad crawl. We
      aim to constrain the last of these by the 1.8 release, leading to
      crawls that wobble (slow down) rather than ever falling down (OOME),
      as long as there's still disk space.

      BdbUriUniqFilter helps defer an OOME until those other structures
      become a problem, by not letting the URL already-seen structures
      grow without bound. However, it's pretty inefficient for this kind
      of set-membership testing, especially once the crawl is big/disperse
      such that the cache isn't helping much. (It gets very slow.)

      BloomUriUniqFilter offers another option: its speed doesn't degrade
      with the number of URIs crawled. However, this comes at the cost of
      a higher false-positive rate (misrecognizing a URI as already-seen
      when it hasn't been) -- and once the crawl gets larger than the size
      the Bloom filter was designed for, the false-positive rate grows to
      approach 100%. The default parameters use ~500MB to achieve a 1-in-
      4 million false-positive rate through 125 million URLs; these can
      be tuned via System properties. (See the BloomUriUniqFilter source
      and http://crawler.archive.org/cgi-bin/wiki.pl?BloomUriUniqFilter
      for details.)

      We've started work on another UriUniqFilter that uses a batch
      merging technique described in the 2001 "High-Performance Web
      Crawling" paper by Mark Najork and Allan Heydon, in section 3.2,
      "Efficient Duplicate URL Eliminators". A rough version is in CVS
      now but it will need more tuning to match or surpass the existing
      options. The hope is that it will offer adequate performance into
      the hundreds of millions of URIs without hitting the walls of the
      current options.

      Regarding the difference between DomainScope and BroadScope
      performance:

      All the 'classic' limited scopes -- DomainScope, HostScope,
      PathScope -- use an inefficient linear probe against all
      acceptable patterns (usually, all seeds) to test if a URI is
      in scope. So, with a large number of seeds, they're slow
      CPU hogs.

      SurtPrefixScope can do anything they can, and much more
      efficiently, so it's worth it to recast anything you were
      using DomainScope for to use SurtPrefixScope instead.

      --

      One other thing which should help a little in the BdbUriUniqFilter
      performance bottleneck is to use the 'queue budgetting' features
      so that the crawler concentrates on a specific queue (host) for a
      while, then rotates it out of activity to give other queues a chance.
      In the BdbFrontier expert settings, this means making sure the
      'cost-policy' is something other than ZeroCostAssignmentPolicy,
      and tending to make the 'balance-replenish-amount' larger rather
      than smaller. The current defaults for these are OK, but if you've
      changed them you may have decreased the potential for the BDB cache
      to benefit from site-locality patterns in discovered links.

      Hope this helps,

      - Gordon @ IA

      > --- In archive-crawler@yahoogroups.com, stack <stack@a...> wrote:
      >
      >>joehung302 wrote:
      >>
      >>
      >>>>Use the bloom filter option for the already-seen in
      >
      > BdbFrontier.
      >
      >>>Seems
      >>>
      >>>>to work better when a machine goes above 30-50million. Bloom
      >>>
      >>>becomes
      >>>
      >>>>saturated at 125million so thats about the upperbound per
      >
      > machine at
      >
      >>>the
      >>>
      >>>>moment unless you up the bloom filter size (but its already
      >
      > big and
      >
      >>>>you'll start eating into heap the crawler is using going about
      >
      > its
      >
      >>>other
      >>>
      >>>>business). Thereafter the rate of false positives -- reports
      >
      > that
      >
      >>>we've
      >>>
      >>>>seen an URL when in fact we haven't -- starts to increase
      >
      > (Read the
      >
      >>>>BloomFilter javadoc for more on its workings).
      >>>>
      >>>
      >>>How confident do you guys feel that if I use broad-scope I can go
      >>>above 50M links (or even 100M links) without OOME on a single
      >
      > machine?
      >
      >>
      >>I'd suggest you startup a proofing test crawl with BroadScope and
      >
      > see it
      >
      >>does.
      >>
      >>On machines with specs like those listed below we've pulled down
      >> >50Million documents per instance with >125million discovered.
      >
      > Scope
      >
      >>was not BroadScope. Once or twice we OOME'd but thought is that
      >>probable cause has been addressed in 1.6 release (If there is an
      >
      > OOME,
      >
      >>you can checkpoint, restart and recover the crawl. Often it will
      >>continue the crawl as it avoids an exact replay of the
      >
      > circumstances
      >
      >>that brought on the OOME).
      >>
      >>One thing I forgot to add to yesterday's list is regular
      >
      > checkpointing
      >
      >>-- every 4 hours or so.
      >>
      >>St.Ack
      >>
      >>
      >>-bash-3.00$ uname -a
      >>Linux crawling015.archive.org 2.6.11-1.27_FC3smp #1 SMP Tue May 17
      >>20:43:11 EDT 2005 i686 athlon i386 GNU/Linux
      >>
      >>-bash-3.00$ more /etc/issue
      >>Fedora Core release 3 (Heidelberg)
      >>Kernel \r on an \m
      >>
      >>Dual AMD Opteron(tm) Processor 246 w/ cpu MHz : 2009.374
      >
      > and
      >
      >>cache size : 1024 KB
      >>
      >>[crawling013 5] ~ > /lib/libc.so.6
      >>GNU C Library stable release version 2.3.4 (20050218), by Roland
      >
      > McGrath
      >
      >>et al.
      >>Copyright (C) 2005 Free Software Foundation, Inc.
      >>This is free software; see the source for copying conditions.
      >>There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
      >>PARTICULAR PURPOSE.
      >>Configured for i586-suse-linux.
      >>Compiled by GNU CC version 3.3.5 20050117 (prerelease) (SUSE
      >
      > Linux).
      >
      >>Compiled on a Linux 2.6.9 system on 2005-06-10.
      >>Available extensions:
      >> GNU libio by Per Bothner
      >> crypt add-on version 2.1 by Michael Glad and others
      >> linuxthreads-0.10 by Xavier Leroy
      >> GNU Libidn by Simon Josefsson
      >> NoVersion patch for broken glibc 2.0 binaries
      >> BIND-8.2.3-T5B
      >> libthread_db work sponsored by Alpha Processor Inc
      >> NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
      >>Thread-local storage support included.
      >>For bug reporting instructions, please see:
      >><http://www.gnu.org/software/libc/bugs.html>.
      >>
      >>
      >>
      >>
      >>We used sun 1.5.0:
      >>
      >>-bash-3.00$ /usr/local/jdk1.5.0_03/bin/java -version
      >>java version "1.5.0_03"
      >>Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_03-
      >
      > b07)
      >
      >>Java HotSpot(TM) Server VM (build 1.5.0_03-b07, mixed mode)
      >>
      >>
      >>
      >>
      >>>That to me that seems to be the deciding factor on whether we
      >
      > should
      >
      >>>start with 5 beefy machines and hope each one can go up to 100M
      >
      > links,
      >
      >>>or with 10 less beefy machines and each one can go up to 50M
      >
      > links
      >
      >>>without OOME.
      >>>
      >>>I know I'm shooting darts in the dark now...I have to start the
      >>>project planning soon so I'd like to take my best guess with all
      >
      > the
      >
      >>>advices I can get.
      >>>
      >>>cheers,
      >>>-joe
      >>>
      >>>
      >>>
      >>>
      >>>
      >>>-----------------------------------------------------------------
      >
      > -------
      >
      >>>YAHOO! GROUPS LINKS
      >>>
      >>> * Visit your group "archive-crawler
      >>> <http://groups.yahoo.com/group/archive-crawler>" on the
      >
      > web.
      >
      >>>
      >>> * To unsubscribe from this group, send an email to:
      >>> archive-crawler-unsubscribe@yahoogroups.com
      >>> <mailto:archive-crawler-unsubscribe@yahoogroups.com?
      >
      > subject=Unsubscribe>
      >
      >>>
      >>> * Your use of Yahoo! Groups is subject to the Yahoo! Terms
      >
      > of
      >
      >>> Service <http://docs.yahoo.com/info/terms/>.
      >>>
      >>>
      >>>-----------------------------------------------------------------
      >
      > -------
      >
      >
      >
      >
      >
      >
      >
      >
      >
      > Yahoo! Groups Links
      >
      >
      >
      >
      >
      >
    • joehung302
      Thanks a lot for the insight. I ll change to use 32bit JVM immediately. I m using the BloomUriUniqFilter already. I ll do some reading on SurtPrefixScope and
      Message 2 of 12 , Dec 20, 2005
        Thanks a lot for the insight.

        I'll change to use 32bit JVM immediately.

        I'm using the BloomUriUniqFilter already.

        I'll do some reading on SurtPrefixScope and see if I can do
        something to get rid of domain scope.

        Will keep you guys posted (or, keep bugging you guys).

        cheers,
        -joe

        --- In archive-crawler@yahoogroups.com, "Gordon Mohr (archive.org)"
        <gojomo@a...> wrote:
        >
        > joehung302 wrote:
        > > I did a proof crawling using BroadScope and 22K seeds. I got
        OOME
        > > within a day. I then checkpoint it, restart the crawler, start
        > > another crawl from the checkpoint, OOME within a day.
        > >
        > > I then changed to use 5K seeds and BroadScope, OOME within a
        day.
        > > Restart with the checkpoint and still OOME within a day.
        > >
        > > I then run 5K seeds with DomainScope (kind of given up on
        > > broadscope). OOME within a day.
        > >
        > > I have my JVM set to -Xmx1500m. BTW, I'm using 64 bit JDK1.5.
        > >
        > > One thing that I observed is, broad scope runs much faster than
        > > domain scope under roughly the same condition. In both
        broadscope
        > > runs I was able to top 1000KB/s bandwidth limit with around 50%
        cpu
        > > usage. In the domain scope run I can only get to 500KB/s
        throughput
        > > with 100% cpu busy.
        > >
        > > I used to be able to run 1.0.4 for a week with <1K seeds and get
        > > around 1M links per day. I thought the bdb improvement should be
        > > able to take more seeds and run longer. I really want the
        crawler to
        > > run with a big seed list because we're going to seed my big
        crawl
        > > with links from ODP.
        > >
        > > Any suggestions that I can try?
        >
        > We've run into problems under 64bit JVMs, and they seem mostly
        > attributable to the fact that the JVM's object pointers are larger
        > and thus the same object structures will take up more RAM.
        >
        > This post from a Sun engineer suggests a rule of thumb of a 40%
        > larger heap to be comparable to a 32bit JVM heap:
        >
        > http://forum.java.sun.com/thread.jspa?threadID=671184
        > (see reply #8)
        >
        > So your 1500m heap in a 64bit JVM may be roughly comparable to a
        > 1071m heap in a 32bit JVM.
        >
        > Further, as noted in the 1.6 release notes, BerkeleyDB-JE 2.0.90's
        > internal mechanisms for staying within the budgetted cache size
        > are inaccurate under 64bit JVMs, so rather than the default 60%
        > cache size, 40% or even 30% would be safer.
        >
        > Even with these adjustments, there are still a few structures in
        > the frontier that slowly grow without bound in a broad crawl. We
        > aim to constrain the last of these by the 1.8 release, leading to
        > crawls that wobble (slow down) rather than ever falling down
        (OOME),
        > as long as there's still disk space.
        >
        > BdbUriUniqFilter helps defer an OOME until those other structures
        > become a problem, by not letting the URL already-seen structures
        > grow without bound. However, it's pretty inefficient for this kind
        > of set-membership testing, especially once the crawl is
        big/disperse
        > such that the cache isn't helping much. (It gets very slow.)
        >
        > BloomUriUniqFilter offers another option: its speed doesn't degrade
        > with the number of URIs crawled. However, this comes at the cost of
        > a higher false-positive rate (misrecognizing a URI as already-seen
        > when it hasn't been) -- and once the crawl gets larger than the
        size
        > the Bloom filter was designed for, the false-positive rate grows to
        > approach 100%. The default parameters use ~500MB to achieve a 1-in-
        > 4 million false-positive rate through 125 million URLs; these can
        > be tuned via System properties. (See the BloomUriUniqFilter source
        > and http://crawler.archive.org/cgi-bin/wiki.pl?BloomUriUniqFilter
        > for details.)
        >
        > We've started work on another UriUniqFilter that uses a batch
        > merging technique described in the 2001 "High-Performance Web
        > Crawling" paper by Mark Najork and Allan Heydon, in section 3.2,
        > "Efficient Duplicate URL Eliminators". A rough version is in CVS
        > now but it will need more tuning to match or surpass the existing
        > options. The hope is that it will offer adequate performance into
        > the hundreds of millions of URIs without hitting the walls of the
        > current options.
        >
        > Regarding the difference between DomainScope and BroadScope
        > performance:
        >
        > All the 'classic' limited scopes -- DomainScope, HostScope,
        > PathScope -- use an inefficient linear probe against all
        > acceptable patterns (usually, all seeds) to test if a URI is
        > in scope. So, with a large number of seeds, they're slow
        > CPU hogs.
        >
        > SurtPrefixScope can do anything they can, and much more
        > efficiently, so it's worth it to recast anything you were
        > using DomainScope for to use SurtPrefixScope instead.
        >
        > --
        >
        > One other thing which should help a little in the BdbUriUniqFilter
        > performance bottleneck is to use the 'queue budgetting' features
        > so that the crawler concentrates on a specific queue (host) for a
        > while, then rotates it out of activity to give other queues a
        chance.
        > In the BdbFrontier expert settings, this means making sure the
        > 'cost-policy' is something other than ZeroCostAssignmentPolicy,
        > and tending to make the 'balance-replenish-amount' larger rather
        > than smaller. The current defaults for these are OK, but if you've
        > changed them you may have decreased the potential for the BDB cache
        > to benefit from site-locality patterns in discovered links.
        >
        > Hope this helps,
        >
        > - Gordon @ IA
        >
        > > --- In archive-crawler@yahoogroups.com, stack <stack@a...> wrote:
        > >
        > >>joehung302 wrote:
        > >>
        > >>
        > >>>>Use the bloom filter option for the already-seen in
        > >
        > > BdbFrontier.
        > >
        > >>>Seems
        > >>>
        > >>>>to work better when a machine goes above 30-50million. Bloom
        > >>>
        > >>>becomes
        > >>>
        > >>>>saturated at 125million so thats about the upperbound per
        > >
        > > machine at
        > >
        > >>>the
        > >>>
        > >>>>moment unless you up the bloom filter size (but its already
        > >
        > > big and
        > >
        > >>>>you'll start eating into heap the crawler is using going about
        > >
        > > its
        > >
        > >>>other
        > >>>
        > >>>>business). Thereafter the rate of false positives -- reports
        > >
        > > that
        > >
        > >>>we've
        > >>>
        > >>>>seen an URL when in fact we haven't -- starts to increase
        > >
        > > (Read the
        > >
        > >>>>BloomFilter javadoc for more on its workings).
        > >>>>
        > >>>
        > >>>How confident do you guys feel that if I use broad-scope I can
        go
        > >>>above 50M links (or even 100M links) without OOME on a single
        > >
        > > machine?
        > >
        > >>
        > >>I'd suggest you startup a proofing test crawl with BroadScope
        and
        > >
        > > see it
        > >
        > >>does.
        > >>
        > >>On machines with specs like those listed below we've pulled down
        > >> >50Million documents per instance with >125million discovered.
        > >
        > > Scope
        > >
        > >>was not BroadScope. Once or twice we OOME'd but thought is that
        > >>probable cause has been addressed in 1.6 release (If there is an
        > >
        > > OOME,
        > >
        > >>you can checkpoint, restart and recover the crawl. Often it
        will
        > >>continue the crawl as it avoids an exact replay of the
        > >
        > > circumstances
        > >
        > >>that brought on the OOME).
        > >>
        > >>One thing I forgot to add to yesterday's list is regular
        > >
        > > checkpointing
        > >
        > >>-- every 4 hours or so.
        > >>
        > >>St.Ack
        > >>
        > >>
        > >>-bash-3.00$ uname -a
        > >>Linux crawling015.archive.org 2.6.11-1.27_FC3smp #1 SMP Tue May
        17
        > >>20:43:11 EDT 2005 i686 athlon i386 GNU/Linux
        > >>
        > >>-bash-3.00$ more /etc/issue
        > >>Fedora Core release 3 (Heidelberg)
        > >>Kernel \r on an \m
        > >>
        > >>Dual AMD Opteron(tm) Processor 246 w/ cpu MHz :
        2009.374
        > >
        > > and
        > >
        > >>cache size : 1024 KB
        > >>
        > >>[crawling013 5] ~ > /lib/libc.so.6
        > >>GNU C Library stable release version 2.3.4 (20050218), by Roland
        > >
        > > McGrath
        > >
        > >>et al.
        > >>Copyright (C) 2005 Free Software Foundation, Inc.
        > >>This is free software; see the source for copying conditions.
        > >>There is NO warranty; not even for MERCHANTABILITY or FITNESS
        FOR A
        > >>PARTICULAR PURPOSE.
        > >>Configured for i586-suse-linux.
        > >>Compiled by GNU CC version 3.3.5 20050117 (prerelease) (SUSE
        > >
        > > Linux).
        > >
        > >>Compiled on a Linux 2.6.9 system on 2005-06-10.
        > >>Available extensions:
        > >> GNU libio by Per Bothner
        > >> crypt add-on version 2.1 by Michael Glad and others
        > >> linuxthreads-0.10 by Xavier Leroy
        > >> GNU Libidn by Simon Josefsson
        > >> NoVersion patch for broken glibc 2.0 binaries
        > >> BIND-8.2.3-T5B
        > >> libthread_db work sponsored by Alpha Processor Inc
        > >> NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
        > >>Thread-local storage support included.
        > >>For bug reporting instructions, please see:
        > >><http://www.gnu.org/software/libc/bugs.html>.
        > >>
        > >>
        > >>
        > >>
        > >>We used sun 1.5.0:
        > >>
        > >>-bash-3.00$ /usr/local/jdk1.5.0_03/bin/java -version
        > >>java version "1.5.0_03"
        > >>Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_03-
        > >
        > > b07)
        > >
        > >>Java HotSpot(TM) Server VM (build 1.5.0_03-b07, mixed mode)
        > >>
        > >>
        > >>
        > >>
        > >>>That to me that seems to be the deciding factor on whether we
        > >
        > > should
        > >
        > >>>start with 5 beefy machines and hope each one can go up to 100M
        > >
        > > links,
        > >
        > >>>or with 10 less beefy machines and each one can go up to 50M
        > >
        > > links
        > >
        > >>>without OOME.
        > >>>
        > >>>I know I'm shooting darts in the dark now...I have to start the
        > >>>project planning soon so I'd like to take my best guess with
        all
        > >
        > > the
        > >
        > >>>advices I can get.
        > >>>
        > >>>cheers,
        > >>>-joe
        > >>>
        > >>>
        > >>>
        > >>>
        > >>>
        > >>>----------------------------------------------------------------
        -
        > >
        > > -------
        > >
        > >>>YAHOO! GROUPS LINKS
        > >>>
        > >>> * Visit your group "archive-crawler
        > >>> <http://groups.yahoo.com/group/archive-crawler>" on the
        > >
        > > web.
        > >
        > >>>
        > >>> * To unsubscribe from this group, send an email to:
        > >>> archive-crawler-unsubscribe@yahoogroups.com
        > >>> <mailto:archive-crawler-unsubscribe@yahoogroups.com?
        > >
        > > subject=Unsubscribe>
        > >
        > >>>
        > >>> * Your use of Yahoo! Groups is subject to the Yahoo! Terms
        > >
        > > of
        > >
        > >>> Service <http://docs.yahoo.com/info/terms/>.
        > >>>
        > >>>
        > >>>----------------------------------------------------------------
        -
        > >
        > > -------
        > >
        > >
        > >
        > >
        > >
        > >
        > >
        > >
        > >
        > > Yahoo! Groups Links
        > >
        > >
        > >
        > >
        > >
        > >
        >
      • joehung302
        Follow up questions on SurtPrefixScope: I m confused about the relationship between seeds and SurtPrefixScope. Let s say I have a SurtScope for crawling .edu
        Message 3 of 12 , Dec 22, 2005
          Follow up questions on SurtPrefixScope:

          I'm confused about the relationship between seeds and
          SurtPrefixScope.

          Let's say I have a SurtScope for crawling .edu sites. something like
          this:

          +http://(edu,

          Then I supply a bunch of regular URLs in seeds.txt.

          The crawler starts, go through URLs in seeds.txt, filter out
          anything except *.edu. Start crawling happily.

          Then I use JMX client to add a bunch of new URLs into the crawler. I
          assume it will filter out non-edu sites again and schedule new URLs
          to crawl.

          My question is, once those new URLs gets downloaded,

          - How does heritrix decide whether to download links extracted from
          newly-added URLs?
          - How does max-link-hop and max-trans-hop parameters in the
          SurtPrefixScope comes into play w.r.t those extracted links?

          Thanks,
          -joe


          --- In archive-crawler@yahoogroups.com, "joehung302" <joe.hung@c...>
          wrote:
          >
          > Thanks a lot for the insight.
          >
          > I'll change to use 32bit JVM immediately.
          >
          > I'm using the BloomUriUniqFilter already.
          >
          > I'll do some reading on SurtPrefixScope and see if I can do
          > something to get rid of domain scope.
          >
          > Will keep you guys posted (or, keep bugging you guys).
          >
          > cheers,
          > -joe
          >
        • Gordon Mohr (archive.org)
          ... Yes. ... Yes, though some notes to keep in mind: - when adding URIs with the JMX importUris, the scope rules are not applied immediately: even URIs that
          Message 4 of 12 , Dec 22, 2005
            joehung302 wrote:
            > Follow up questions on SurtPrefixScope:
            >
            > I'm confused about the relationship between seeds and
            > SurtPrefixScope.
            >
            > Let's say I have a SurtScope for crawling .edu sites. something like
            > this:
            >
            > +http://(edu,
            >
            > Then I supply a bunch of regular URLs in seeds.txt.
            >
            > The crawler starts, go through URLs in seeds.txt, filter out
            > anything except *.edu. Start crawling happily.

            Yes.

            > Then I use JMX client to add a bunch of new URLs into the crawler. I
            > assume it will filter out non-edu sites again and schedule new URLs
            > to crawl.

            Yes, though some notes to keep in mind:
            - when adding URIs with the JMX importUris, the scope rules are not
            applied immediately: even URIs that would not pass the existing scope
            rules can be scheduled in the frontier. However, if you've retained the
            default "Preselector" processor, which rechecks URI scope when it is
            released from the frontier, they may be rejected at that point.
            - if you flag the URIs as new seeds, and you've set the scope to be
            derived from, adding new seeds will be alter your current seeds.txt
            and should expand the effective scope.
            (HOWEVER, looking at the code right now, it looks like this might
            be broken in SurtPrefixScope, although it works as intended in the
            SurtPrefixedDecideRule used to effect SURT-based scopes in
            DecidingScope.)
            - You can import URIs using the same format as the recovery log uses,
            with a 'hopsPath' (the string of letters describing how the URI
            was discovered) and 'via' (immediately preceding URI). (This may
            effect further scoping decisions, and is relevant to one of your
            questions below.)

            > My question is, once those new URLs gets downloaded,
            >
            > - How does heritrix decide whether to download links extracted from
            > newly-added URLs?

            Every URI that is discovered by the Extractor processors gets queued up
            inside the originating URI as it continues its processing. The "LinksScoper"
            processor then tests each discovered URI against the configured crawl scope:
            if the scope accepts the URI, it survives to be scheduled into the frontier
            in a future step (the "FrontierScheduler" processor). If the scope rejects
            the URI, it is discarded.

            (As noted above, the "Preselector" also redundantly rechecks scope
            when the URI comes out of the frontier -- in case the scope has changed.
            If your scope never changes, you could leave Preselector out; if your scope
            rarely changes, you could uncheck the 'enabled' option on the Preselector so
            it's normally skipped -- but then if you do change scope, remember to enable
            the rechecking.)

            > - How does max-link-hop and max-trans-hop parameters in the
            > SurtPrefixScope comes into play w.r.t those extracted links?

            The 'hopsPath' chain is the important determinant of how these
            maximums are applied. If there are more 'L' hops in the chain than
            'max-link-hops', a URI is ruled-out, even if it would otherwise
            be in scope.

            In contrast, the 'max-trans-hops' indicates how much of a 'free ride'
            given to URIs that are 'transcluded' -- meaning referred to as necessary
            inside a purely in-scope page, as with an IMG or FRAME SRC. As long as
            the number of 'transitive hops' ('P'recondition/'R'eferral/'E'mbed/
            'X'speculative-embed) at the *end* of the 'hopsPath' doesn't exceed the
            'max-trans-hops', the URI will be ruled-in, even if it otherwise would
            not be in scope.

            So for example a hopsPath of 'LLXLEE' has *2* trans-hops at the end.
            If you were crawling '*.edu', but a URI with 'LLXLEE' hops-path was
            on a '.com' host, it would still be included if your 'max-trans-hops'
            was 2 or higher, but discarded if your 'max-trans-hops' was 1 or 0.

            Hope this helps clear things up.

            If you are trying new scopes, the otherthing to look into is the
            DecidingScope. It 'unwraps' some of the things bundled together in the
            classic scopes to be separate reorderable 'DecideRules', applied in
            sequence. As a result, you can gain even finer control over what's
            included and what isn't.

            - Gordon @ IA




            > Thanks,
            > -joe
            >
            >
            > --- In archive-crawler@yahoogroups.com, "joehung302" <joe.hung@c...>
            > wrote:
            >
            >>Thanks a lot for the insight.
            >>
            >>I'll change to use 32bit JVM immediately.
            >>
            >>I'm using the BloomUriUniqFilter already.
            >>
            >>I'll do some reading on SurtPrefixScope and see if I can do
            >>something to get rid of domain scope.
            >>
            >>Will keep you guys posted (or, keep bugging you guys).
            >>
            >>cheers,
            >>-joe
            >>
            >
            >
            >
            >
            >
            >
            >
            >
            >
            > Yahoo! Groups Links
            >
            >
            >
            >
            >
            >
          • joehung302
            ... queued up ... How about new URIs discovered through the JMX importUris as non-seed? Let s say I JMX imported this link (http://members.aol.com/joe) as
            Message 5 of 12 , Dec 22, 2005
              >
              > Every URI that is discovered by the Extractor processors gets
              queued up
              > inside the originating URI as it continues its processing.

              How about new URIs discovered through the JMX importUris as non-seed?
              Let's say I JMX imported this link (http://members.aol.com/joe) as
              non-seed and this link gets crawled/extracted and the crawler get
              two new links

              http://members.aol.com/joe/kid1.html
              http://members.aol.com/joe/kid2.html

              Since http://members.aol.com/joe is not seed, would the crawler
              continue to download

              http://members.aol.com/joe/kid1.html
              http://members.aol.com/joe/kid2.html

              > If you are trying new scopes, the otherthing to look into is the
              > DecidingScope. It 'unwraps' some of the things bundled together in
              the
              > classic scopes to be separate reorderable 'DecideRules', applied in
              > sequence. As a result, you can gain even finer control over what's
              > included and what isn't.
              >

              Frankly I'm willing to try anything that would leads me to a 500M
              links crawl. Right now it seems to me the most promising methods are
              split-crawl technique with SurtPrefixScope and moving uris around.
              I'm hoping someone on this list can shed some light...

              cheers,
              -joe
            • Gordon Mohr (archive.org)
              ... Depends on the rest of the scope settings. Would these two URIs have been accepted by the scope before the importUris, if they had been discovered on a
              Message 6 of 12 , Dec 22, 2005
                joehung302 wrote:
                >>Every URI that is discovered by the Extractor processors gets
                >
                > queued up
                >
                >>inside the originating URI as it continues its processing.
                >
                >
                > How about new URIs discovered through the JMX importUris as non-seed?
                > Let's say I JMX imported this link (http://members.aol.com/joe) as
                > non-seed and this link gets crawled/extracted and the crawler get
                > two new links
                >
                > http://members.aol.com/joe/kid1.html
                > http://members.aol.com/joe/kid2.html
                >
                > Since http://members.aol.com/joe is not seed, would the crawler
                > continue to download
                >
                > http://members.aol.com/joe/kid1.html
                > http://members.aol.com/joe/kid2.html
                >

                Depends on the rest of the scope settings. Would these two URIs
                have been accepted by the scope before the importUris, if they
                had been discovered on a crawled page? The fact that they were
                discovered on a URI that was imported makes no difference -- the
                same rules will be applied at the LinksScoping step.

                >>If you are trying new scopes, the otherthing to look into is the
                >>DecidingScope. It 'unwraps' some of the things bundled together in
                >
                > the
                >
                >>classic scopes to be separate reorderable 'DecideRules', applied in
                >>sequence. As a result, you can gain even finer control over what's
                >>included and what isn't.
                >>
                >
                >
                > Frankly I'm willing to try anything that would leads me to a 500M
                > links crawl. Right now it seems to me the most promising methods are
                > split-crawl technique with SurtPrefixScope and moving uris around.
                > I'm hoping someone on this list can shed some light...

                Yes, something using SURTs to split among multiple crawlers is your best
                bet given the current software. You may still want to have all the crawlers
                have the same scope, but each only retain URIs of a portion of that scope,
                using the CrawlMapper processor, as noted in the messages:

                http://groups.yahoo.com/group/archive-crawler/message/2348
                http://groups.yahoo.com/group/archive-crawler/message/2402

                - Gordon @ IA
              • joehung302
                ... crawlers Would the following configuration breaks .com into two? Machine A s SurtPrefixScope http://(com.a http://(com.b http://(com.c ... http://(com.n
                Message 7 of 12 , Dec 23, 2005
                  >
                  > Yes, something using SURTs to split among multiple crawlers is your best
                  > bet given the current software. You may still want to have all the
                  crawlers

                  Would the following configuration breaks .com into two?

                  Machine A's SurtPrefixScope
                  http://(com.a
                  http://(com.b
                  http://(com.c
                  ...
                  http://(com.n

                  Machine B's SurtPrefixScope
                  http://(com.o
                  http://(com.p
                  ...
                  http://(com.z

                  Assuming you guys implemented what you described in the wiki
                  http://crawler.archive.org/cgi-bin/wiki.pl?SurtScope

                  path: http://(org,archive,www,)/movies/
                  host: http://(org,archive,www,)/
                  host-ex: http://(org,archive,www,
                  subsuff: http://(org,archive,www (accepts www1,wwww,www99, etc. in
                  addition to all above)

                  I guess the 'subsuff' is the key...

                  Thanks,
                  -joe
                Your message has been successfully submitted and would be delivered to recipients shortly.