Loading ...
Sorry, an error occurred while loading the content.
 

Re: [archive-crawler] continuous crawling proposal

Expand Messages
  • Tom Emerson
    ... Indeed, this and the WWW6 paper Broder coauthored are really the classic pieces on the problem. I have a paper by Sergei Brin and/or Larry Page on the
    Message 1 of 24 , Feb 2, 2005
      stack writes:
      > On the resemblance of pages, there's been quite a bit written. This
      > seems to be the 'classic':
      > http://citeseer.ist.psu.edu/broder97resemblance.html.

      Indeed, this and the WWW6 paper Broder coauthored are really the
      "classic" pieces on the problem. I have a paper by Sergei Brin and/or
      Larry Page on the method Google used to use (or may still be using, I
      don't know).

      The Broder shingling algorithm is relatively straight forward to
      implement, but it requires that you remove markup and tokenize the
      input. Stripping markup can be tricky, though for this task you can do
      so with extreme prejudice usually. Tokenization can be more difficult
      in some languages, like Chinese or Japanese where a straight
      implementation of Broder's algorithm doesn't work well.

      Another problem that many of these papers fail to address is how to
      deal with encoding differences. For example, it is very common to see
      Arabic documents encoded in Latin 1 using HTML character references
      for the Arabic instead of encoding in CP1256 (or 8859-6 or UTF-8) and
      representing the characters directly. In this case you either need to
      first normalize the encodings before generating the shingles (which
      has its own complexity) or just punt. Japanese has two regularly used
      encodings. Russian has at least three... so the "same" document may
      appear in different encodings, and none of the methods will work
      without first normalizing.

      There was a long thread on duplicate/similar document detection on the
      Corpora-L mailing list a month or two ago. While this was
      concentrating on finding duplicates in linguistic corpora, most (if
      not all) of the information there is applicable to this problem.

      -tree

      --
      Tom Emerson Basis Technology Corp.
      Software Architect http://www.basistech.com
      "Beware the lollipop of mediocrity: lick it once and you suck forever"
    • John R. Frank
      Heritrix, What do people think of changing the Postselector to use generic error flags that would be set by the Fetch chain? This way, it wouldn t be
      Message 2 of 24 , Feb 6, 2005
        Heritrix,

        What do people think of changing the Postselector to use generic error
        flags that would be set by the Fetch chain? This way, it wouldn't be
        specifically dependent on HTTP status codes per se.

        John
      • Michael Stack
        ... This makes sense, especially if intent is to have fetchers that can do other than HTTP. The awkward thing would be that fetch status would need to
        Message 3 of 24 , Feb 7, 2005
          John R. Frank wrote:

          > Heritrix,
          >
          > What do people think of changing the Postselector to use generic error
          > flags that would be set by the Fetch chain? This way, it wouldn't be
          > specifically dependent on HTTP status codes per se.

          This makes sense, especially if intent is to have fetchers that can do
          other than HTTP.

          The awkward thing would be that fetch status would need to accomodate
          the extent of the HTTP result code vocabulary including redirects and
          401-type notions (I suppose ye've looked at mapping the new protocol
          result codes to HTTP and it doesn't work for ye?).

          All fetch status codes are defined in
          org.archive.crawler.datamodel.CrawlURI. Check it out. As is, there is
          already some mixing of two 'protocols' -- DNS and HTTP.

          St.Ack

          >
          > John
          >
          > ------------------------------------------------------------------------
          > *Yahoo! Groups Links*
          >
          > * To visit your group on the web, go to:
          > http://groups.yahoo.com/group/archive-crawler/
          >
          > * To unsubscribe from this group, send an email to:
          > archive-crawler-unsubscribe@yahoogroups.com
          > <mailto:archive-crawler-unsubscribe@yahoogroups.com?subject=Unsubscribe>
          >
          > * Your use of Yahoo! Groups is subject to the Yahoo! Terms of
          > Service <http://docs.yahoo.com/info/terms/>.
          >
          >
        • Kristinn Sigurdsson
          Probably the only way to do this intelligently would be to have two status flags/strings/codes. One would be the Heritrix generic status code and another per
          Message 4 of 24 , Feb 7, 2005
            Message
            Probably the only way to do this intelligently would be to have two status flags/strings/codes. One would be the Heritrix generic status code and another 'per protocol' code that would vary (and sometimes be omitted) based on the protocol under use. Thus processors would be able to access the HTTP status codes if they need them.
             
            I'd imagine that the Heritrix status code would only have a handful of codes for the entire set of HTTP codes, possibly just one to indicate that the software received a response from the server, without going into details about what that response was. More likely we'd want a few extra to denote retryable connection errors etc.
             
            In any case, I don't see this as a priority issue until we add FTP (or some other protocol) support (and we need to address a few other issues before that is possible). This does need to be addressed before we can add further protocols, since (FTP at least) uses codes that overlap the HTTP ones.
             
            - Kris
            -----Original Message-----
            From: Michael Stack [mailto:stack@...]
            Sent: 7. febrĂșar 2005 16:42
            To: archive-crawler@yahoogroups.com
            Subject: Re: [archive-crawler] Postselector generalization

            John R. Frank wrote:

            > Heritrix,
            >
            > What do people think of changing the Postselector to use generic error
            > flags that would be set by the Fetch chain?  This way, it wouldn't be
            > specifically dependent on HTTP status codes per se.

            This makes sense, especially if intent is to have fetchers that can do
            other than HTTP.

            The awkward thing would be that fetch status would need to accomodate
            the extent of the HTTP result code vocabulary including redirects and
            401-type notions (I suppose ye've looked at mapping the new protocol
            result codes to HTTP and it doesn't work for ye?).

            All fetch status codes are defined in
            org.archive.crawler.datamodel.CrawlURI.  Check it out.  As is,  there is
            already some mixing of two 'protocols' -- DNS and HTTP.

            St.Ack

            >
            > John
            >
            > ------------------------------------------------------------------------
            > *Yahoo! Groups Links*
            >
            >     * To visit your group on the web, go to:
            >       http://groups.yahoo.com/group/archive-crawler/
            >       
            >     * To unsubscribe from this group, send an email to:
            >       archive-crawler-unsubscribe@yahoogroups.com
            >       <mailto:archive-crawler-unsubscribe@yahoogroups.com?subject=Unsubscribe>
            >       
            >     * Your use of Yahoo! Groups is subject to the Yahoo! Terms of
            >       Service <http://docs.yahoo.com/info/terms/>.
            >
            >

          • Dave Skinner
            I normally do broad crawls and use lots of special filters to limit the scope so that heritrix does not just run out of memory and crash.... However the
            Message 5 of 24 , Feb 7, 2005
              I normally do broad crawls and use lots of special filters to limit the
              scope so that heritrix does not just run out of memory and crash....

              However the internal issue that sometimes a filter normally returns true
              and sometime false is a problem for me and looking at the code and reading
              things like the user and developer manual makes me think that I'm not the
              only person it has bothered.

              in a recent private email to me someone said

              >Filters are not our proudest moment. ..... was to have replaced the way
              >they work by now. Here is a writeup that was done on an alternative:
              >http://crawler.archive.org/cgi-bin/wiki.pl?NewScopingModel. We just
              >haven't gotten around to it.

              The proposal would be a great thing to have implemented, but it looks to me
              that it has about as much work in it as some master thesis projects and its
              going to be hard to get someone to do it.

              I'd like to propose something that would simplify things in the short term
              and hopefully make any future changes/testing easier.

              Let's change the definition of the result of filter to simply be

              *true* means to continue *normal processing*.

              (this compliments the result of the pre-fetch filters)

              I think everyone likely to read this knows what normal processing means but
              just to clarify what I think it means, let's say that normal processing is
              that a URI progresses step by step thru the chain illustrated in the user
              manual in section 6.1.3.

              As an extension of this I'd like say that a filter that has *enabled* set
              to false or is incompletely configured (for example a regexp filter with no
              filter string) should normally return true. (As an example of a current
              problem related to this, just set enabled in the pathdepth filter
              configured in the default profile to false.)

              I was going to suggest this last week and post some sample code but had a
              machine run out of disk storage while doing a crawl and linux instead of
              just crashing the job reused the disk space that was being occupied by
              recently referenced files. As a result of that problem I had to recover my
              source and retest everything.

              As part of that I downloaded the current head and implemented this (so now
              I have real code). Right now it is working but I need to review the following.

              Most filters can use the "return true" if not enabled or configured idea
              except some filters used in scopes. They may have a special problem which
              I'm going to look at later today.

              The following formula is in the user manual
              ( ( focusFilter.accepts(u)
              || transitiveFilter.accepts(u) )
              && exclusionFilter.accepts(u) == false )

              The accurate version of it is in the developer manual

              protected final boolean innerAccepts(Object o) {
              return ((isSeed(o) || focusAccepts(o)) || additionalFocusAccepts(o) ||
              transitiveAccepts(o)) && !excludeAccepts(o);

              the potential problem is in the focusAccepts and additionalFocusAccepts calls

              btw, in my code I've changed this equation to be

              return (((isSeed(o) || focusAccepts(o)) || additionalFocusAccepts(o) ||
              transitiveAccepts(o)) && excludeAccepts(o);

              I'm not too worried if the focusAccepts have to remain special cases
              because end users are not too likely to be changing them.

              the need for the "OR" filter has vanished.

              here is the diff between the current profile in head and the proposed new one

              7,43c29,38
              < <newObject name="exclude-filter"
              class="org.archive.crawler.filter.OrFilter">
              < <boolean name="enabled">true</boolean>
              < <boolean name="if-matches-return">true</boolean>
              < <map name="filters">
              < <newObject name="pathdepth"
              < class="org.archive.crawler.filter.PathDepthFilter">
              < <boolean name="enabled">true</boolean>
              < <integer name="max-path-depth">20</integer>
              < <boolean name="path-less-or-equal-return">false</boolean>
              < </newObject>
              < <newObject name="pathologicalpath"
              <
              class="org.archive.crawler.filter.PathologicalPathFilter">
              < <boolean name="enabled">true</boolean>
              < <integer name="repetitions">3</integer>
              < </newObject>
              < </map>
              < </newObject>
              ---
              > <map name="exclude-filters">
              > <newObject name="pathdepth"
              class="org.archive.crawler.filter.PathDepthFilter">
              > <boolean name="enabled">true</boolean>
              > <integer name="max-path-depth">20</integer>
              > </newObject>
              > <newObject name="pathological"
              class="org.archive.crawler.filter.PathologicalPathFilter">
              > <boolean name="enabled">true</boolean>
              > <integer name="repetitions">3</integer>
              > </newObject>
              > </map>

              I'd post code or diff's but there is just too much to post all of it and
              until there is some discussion about it I would not know what to use for
              examples... and right now some of it looks hacked (I did not delete the
              previous code, I usually just commented it out and its full of print
              statements), I need to clean it up.

              I'm serious enough about this that if someone is using special filters and
              does not have a programmer available to change or test them, I'll do it for
              them.


              Dave Skinner dave at solid dot net
              High Performance Programming---assembly (lots of them), C, java, perl
              Database and Non-trivial web site implementations
              Real-time and embedded systems are my specialty
            • Dave Skinner
              I m noticing a phenomena that I dont like some of the dot php sites will return an redirected error page full of links when trying to fetch robots.txt. How
              Message 6 of 24 , Feb 7, 2005
                I'm noticing a phenomena that I dont like

                some of the dot php sites will return an redirected error page full of
                links when trying to fetch robots.txt.

                How about changing things so if there is a redirect on robots.txt it does
                not follow the redirect...

                anyone see any problems?
              • stack
                ... Can you send over sample URLs and why this is causing you distress Dave? If the landing pages are 404s, we shouldn t be crawling any links in the page
                Message 7 of 24 , Feb 9, 2005
                  Dave Skinner wrote:

                  > I'm noticing a phenomena that I dont like
                  >
                  > some of the dot php sites will return an redirected error page full of
                  > links when trying to fetch robots.txt.
                  >
                  > How about changing things so if there is a redirect on robots.txt it does
                  > not follow the redirect...
                  >
                  > anyone see any problems?

                  Can you send over sample URLs and why this is causing you distress
                  Dave? If the landing pages are 404s, we shouldn't be crawling any links
                  in the page (See
                  http://crawler.archive.org/xref/org/archive/crawler/postprocessor/Postselector.html#302).
                  Otherwise, don't we want to crawl the page?
                  Thanks,
                  St.Ack

                  >
                  >
                  > *Yahoo! Groups Sponsor*
                  > ADVERTISEMENT
                  >
                  >
                  > ------------------------------------------------------------------------
                  > *Yahoo! Groups Links*
                  >
                  > * To visit your group on the web, go to:
                  > http://groups.yahoo.com/group/archive-crawler/
                  >
                  > * To unsubscribe from this group, send an email to:
                  > archive-crawler-unsubscribe@yahoogroups.com
                  > <mailto:archive-crawler-unsubscribe@yahoogroups.com?subject=Unsubscribe>
                  >
                  > * Your use of Yahoo! Groups is subject to the Yahoo! Terms of
                  > Service <http://docs.yahoo.com/info/terms/>.
                  >
                  >
                • Dave Skinner
                  the basic problem is not really in heritrix. I ve mostly seen it on systems being driven by php scripts. heritrix asks for robots.txt but robots.txt does
                  Message 8 of 24 , Feb 9, 2005
                    the basic problem is not really in heritrix. I've mostly seen it on
                    systems being driven by php scripts. heritrix asks for robots.txt but
                    robots.txt does not exist and instead of the server just saying so with a
                    404 error, it returns a 301 or 302 (sometimes even to another host) the
                    resulting page after the redirect (with a 200 return code) is a nice
                    friendly message from the php script that is full of URLs of things you
                    might want to visit instead of the document you asked for.

                    we cant fix every poorly designed web site out there so I'm suggesting mods
                    to heritrix instead.

                    I've done a simple workaround (using a normal regexp filter) that filters
                    out any page with a name containing .*404.* . 404.php shows up often
                    enough that it must be an example in a popular book. The problem is that
                    it's not always that obvious.

                    different issue, but related. maybe another mod we should consider is to
                    not extract links from documents named robots.txt. that would be pretty
                    easy now that all the extractors are looking at that value that says the
                    document has been extracted. But this might be the opposite of what
                    Internet Archive would want. Some of the fake robots.txt files are like
                    site maps.

                    At 06:42 PM 2/9/2005, you wrote:

                    >Dave Skinner wrote:
                    >
                    > > I'm noticing a phenomena that I dont like
                    > >
                    > > some of the dot php sites will return an redirected error page full of
                    > > links when trying to fetch robots.txt.
                    > >
                    > > How about changing things so if there is a redirect on robots.txt it does
                    > > not follow the redirect...
                    > >
                    > > anyone see any problems?
                    >
                    >Can you send over sample URLs and why this is causing you distress
                    >Dave? If the landing pages are 404s, we shouldn't be crawling any links
                    >in the page (See
                    >http://crawler.archive.org/xref/org/archive/crawler/postprocessor/Postselector.html#302).
                    >
                    >Otherwise, don't we want to crawl the page?
                    >Thanks,
                    >St.Ack

                    Dave Skinner dave at solid dot net
                    High Performance Programming---assembly (lots of them), C, java, perl
                    Database and Non-trivial web site implementations
                    Real-time and embedded systems are my specialty
                  • stack
                    Pardon the tardy reply to your considered proposal (It took a while for us all at the archive -- Dan, Igor, Gordon and myself -- to get together so we could
                    Message 9 of 24 , Feb 10, 2005
                      Pardon the tardy reply to your considered proposal (It took a while for
                      us all at the archive -- Dan, Igor, Gordon and myself -- to get together
                      so we could pow wow over its content).

                      Yes, scoping and filters are the messiest part of Heritrix and they need
                      fixing [See
                      http://crawler.archive.org/articles/user_manual.html#scopeproblems%5d.

                      Pondering your filter definition change, particularly how it would apply
                      to scoping, we found ourselves starting to rename filters to better
                      clarify their actions -- this'd be a good thing -- having to reverse
                      their operation in a few cases so they'd fit with your proposal. But we
                      found that the changes quickly started to mount in number. At a certain
                      stage -- considering that any change to filters/scoping will be highly
                      disruptive requiring reformulation of order files -- we called a halt
                      and decided that we'd rather just go the whole hog and go for the
                      already cited 'NewScopingModel' described in
                      http://crawler.archive.org/cgi-bin/wiki.pl?NewScopingModel
                      <http://crawler.archive.org/cgi-bin/wiki.pl?NewScopingModel.> The
                      thinking here is that its more like a term paper than it is a master
                      thesis (smile); that we could get it up and running pretty quickly.
                      We've bumped up the priority on the scoping change to be the next
                      feature we implement after we address OOME'ing.

                      In spite of the above, may we take a look at your patch, whatever its
                      state, so we can see the nature of the changes made implementing your
                      proposal? Maybe its possible to get away with a small number of changes
                      afterall. Send it to me privately if you'd like.

                      (Lets consider your suggestion that a filter return 'true' if
                      misconfigured or disabled separately as part of '[ 1103015 ] If filter
                      in main scope disabled heritrix aborts imme...' and '[ 1105025 ]
                      Prefetch filter should skip eval if disabled').
                      <>
                      Yours,
                      St.Ack

                      P.S. I fixed the manuals. Thanks for pointing out the discrepency.



                      Dave Skinner wrote:

                      > I normally do broad crawls and use lots of special filters to limit the
                      > scope so that heritrix does not just run out of memory and crash....
                      >
                      > However the internal issue that sometimes a filter normally returns true
                      > and sometime false is a problem for me and looking at the code and
                      > reading
                      > things like the user and developer manual makes me think that I'm not the
                      > only person it has bothered.
                      >
                      > in a recent private email to me someone said
                      >
                      > >Filters are not our proudest moment. ..... was to have replaced the way
                      > >they work by now. Here is a writeup that was done on an alternative:
                      > >http://crawler.archive.org/cgi-bin/wiki.pl?NewScopingModel. We just
                      > >haven't gotten around to it.
                      >
                      > The proposal would be a great thing to have implemented, but it looks
                      > to me
                      > that it has about as much work in it as some master thesis projects
                      > and its
                      > going to be hard to get someone to do it.
                      >
                      > I'd like to propose something that would simplify things in the short
                      > term
                      > and hopefully make any future changes/testing easier.
                      >
                      > Let's change the definition of the result of filter to simply be
                      >
                      > *true* means to continue *normal processing*.
                      >
                      > (this compliments the result of the pre-fetch filters)
                      >
                      > I think everyone likely to read this knows what normal processing
                      > means but
                      > just to clarify what I think it means, let's say that normal
                      > processing is
                      > that a URI progresses step by step thru the chain illustrated in the user
                      > manual in section 6.1.3.
                      >
                      > As an extension of this I'd like say that a filter that has *enabled* set
                      > to false or is incompletely configured (for example a regexp filter
                      > with no
                      > filter string) should normally return true. (As an example of a current
                      > problem related to this, just set enabled in the pathdepth filter
                      > configured in the default profile to false.)
                      >
                      > I was going to suggest this last week and post some sample code but had a
                      > machine run out of disk storage while doing a crawl and linux instead of
                      > just crashing the job reused the disk space that was being occupied by
                      > recently referenced files. As a result of that problem I had to
                      > recover my
                      > source and retest everything.
                      >
                      > As part of that I downloaded the current head and implemented this (so
                      > now
                      > I have real code). Right now it is working but I need to review the
                      > following.
                      >
                      > Most filters can use the "return true" if not enabled or configured idea
                      > except some filters used in scopes. They may have a special problem
                      > which
                      > I'm going to look at later today.
                      >
                      > The following formula is in the user manual
                      > ( ( focusFilter.accepts(u)
                      > || transitiveFilter.accepts(u) )
                      > && exclusionFilter.accepts(u) == false )
                      >
                      > The accurate version of it is in the developer manual
                      >
                      > protected final boolean innerAccepts(Object o) {
                      > return ((isSeed(o) || focusAccepts(o)) ||
                      > additionalFocusAccepts(o) ||
                      > transitiveAccepts(o)) && !excludeAccepts(o);
                      >
                      > the potential problem is in the focusAccepts and
                      > additionalFocusAccepts calls
                      >
                      > btw, in my code I've changed this equation to be
                      >
                      > return (((isSeed(o) || focusAccepts(o)) || additionalFocusAccepts(o) ||
                      > transitiveAccepts(o)) && excludeAccepts(o);
                      >
                      > I'm not too worried if the focusAccepts have to remain special cases
                      > because end users are not too likely to be changing them.
                      >
                      > the need for the "OR" filter has vanished.
                      >
                      > here is the diff between the current profile in head and the proposed
                      > new one
                      >
                      > 7,43c29,38
                      > < <newObject name="exclude-filter"
                      > class="org.archive.crawler.filter.OrFilter">
                      > < <boolean name="enabled">true</boolean>
                      > < <boolean name="if-matches-return">true</boolean>
                      > < <map name="filters">
                      > < <newObject name="pathdepth"
                      > < class="org.archive.crawler.filter.PathDepthFilter">
                      > < <boolean name="enabled">true</boolean>
                      > < <integer name="max-path-depth">20</integer>
                      > < <boolean
                      > name="path-less-or-equal-return">false</boolean>
                      > < </newObject>
                      > < <newObject name="pathologicalpath"
                      > <
                      > class="org.archive.crawler.filter.PathologicalPathFilter">
                      > < <boolean name="enabled">true</boolean>
                      > < <integer name="repetitions">3</integer>
                      > < </newObject>
                      > < </map>
                      > < </newObject>
                      > ---
                      > > <map name="exclude-filters">
                      > > <newObject name="pathdepth"
                      > class="org.archive.crawler.filter.PathDepthFilter">
                      > > <boolean name="enabled">true</boolean>
                      > > <integer name="max-path-depth">20</integer>
                      > > </newObject>
                      > > <newObject name="pathological"
                      > class="org.archive.crawler.filter.PathologicalPathFilter">
                      > > <boolean name="enabled">true</boolean>
                      > > <integer name="repetitions">3</integer>
                      > > </newObject>
                      > > </map>
                      >
                      > I'd post code or diff's but there is just too much to post all of it and
                      > until there is some discussion about it I would not know what to use for
                      > examples... and right now some of it looks hacked (I did not delete the
                      > previous code, I usually just commented it out and its full of print
                      > statements), I need to clean it up.
                      >
                      > I'm serious enough about this that if someone is using special filters
                      > and
                      > does not have a programmer available to change or test them, I'll do
                      > it for
                      > them.
                      >
                      >
                      > Dave Skinner dave at solid dot net
                      > High Performance Programming---assembly (lots of them), C, java, perl
                      > Database and Non-trivial web site implementations
                      > Real-time and embedded systems are my specialty
                      >
                      >
                      >
                      > *Yahoo! Groups Sponsor*
                      > ADVERTISEMENT
                      >
                      >
                      > ------------------------------------------------------------------------
                      > *Yahoo! Groups Links*
                      >
                      > * To visit your group on the web, go to:
                      > http://groups.yahoo.com/group/archive-crawler/
                      >
                      > * To unsubscribe from this group, send an email to:
                      > archive-crawler-unsubscribe@yahoogroups.com
                      > <mailto:archive-crawler-unsubscribe@yahoogroups.com?subject=Unsubscribe>
                      >
                      > * Your use of Yahoo! Groups is subject to the Yahoo! Terms of
                      > Service <http://docs.yahoo.com/info/terms/>.
                      >
                      >
                    • Dave Skinner
                      ... yep, been there, done that, etc ... I hope you are right. In the distant past I did lots of term papers in a few hours. Maybe I m reading more into the
                      Message 10 of 24 , Feb 10, 2005
                        >
                        >clarify their actions -- this'd be a good thing -- having to reverse
                        >their operation in a few cases so they'd fit with your proposal. But we
                        >found that the changes quickly started to mount in number. At a certain
                        >stage -- considering that any change to filters/scoping will be highly
                        >disruptive requiring reformulation of order files -- we called a halt

                        yep, been there, done that, etc

                        >and decided that we'd rather just go the whole hog and go for the
                        >already cited 'NewScopingModel' described in
                        >http://crawler.archive.org/cgi-bin/wiki.pl?NewScopingModel
                        ><http://crawler.archive.org/cgi-bin/wiki.pl?NewScopingModel.> The
                        >thinking here is that its more like a term paper than it is a master
                        >thesis (smile); that we could get it up and running pretty quickly.

                        I hope you are right. In the distant past I did lots of term papers in a
                        few hours. Maybe I'm reading more into the proposal than you really put in
                        there.

                        >We've bumped up the priority on the scoping change to be the next
                        >feature we implement after we address OOME'ing.

                        I'd like that, whatever the filter model is, it should simplify my life.

                        Its strange you mention OOMEs. I woke up this morning thinking about them
                        and mentally put them on my list of things to look at today. Then I
                        discovered that I got one three hours into a 10-12 hour run last night. A
                        512Mb heap and it still shut down. When I looked at it, the heritrix page
                        header said was only using about 170megs of heap. I've not looked for it,
                        but is heritrix forcing garbage collection once in a while? I think that
                        forcing it with gc()?? and runFinalization()?? does a better job than the
                        automatic garbage collection. I'd put together a chunk of pseudo code that
                        should not impact thruput if you like. btw, most (all?) of the OOME's I'm
                        getting are in one of the special extractors, ie js, pdf, doc.... Maybe
                        they should just create an alert and keep running instead of stopping. It
                        may be that just the one thread is out of memory.

                        >In spite of the above, may we take a look at your patch, whatever its
                        >state, so we can see the nature of the changes made implementing your
                        >proposal? Maybe its possible to get away with a small number of changes
                        >afterall. Send it to me privately if you'd like.

                        right now its patch(es) and they are all over. But I'll make some diffs,
                        figure out what really needs to be there for this, tar it up, and mail it
                        to you. I must be close to being done, they have been getting simpler.

                        do you want diffs or my "working sources"? The sources I'm using for this
                        are based on head about a week ago.


                        >(Lets consider your suggestion that a filter return 'true' if
                        >misconfigured or disabled separately as part of '[ 1103015 ] If filter
                        >in main scope disabled heritrix aborts imme...' and '[ 1105025 ]
                        >Prefetch filter should skip eval if disabled').
                        ><>
                        >Yours,
                        >St.Ack
                        >
                        >P.S. I fixed the manuals. Thanks for pointing out the discrepency.
                        >
                        >
                        >
                        >Dave Skinner wrote:
                        >
                        > > I normally do broad crawls and use lots of special filters to limit the
                        > > scope so that heritrix does not just run out of memory and crash....
                        > >
                        > > However the internal issue that sometimes a filter normally returns true
                        > > and sometime false is a problem for me and looking at the code and
                        > > reading
                        > > things like the user and developer manual makes me think that I'm not the
                        > > only person it has bothered.
                        > >
                        > > in a recent private email to me someone said
                        > >
                        > > >Filters are not our proudest moment. ..... was to have replaced the way
                        > > >they work by now. Here is a writeup that was done on an alternative:
                        > > >http://crawler.archive.org/cgi-bin/wiki.pl?NewScopingModel. We just
                        > > >haven't gotten around to it.
                        > >
                        > > The proposal would be a great thing to have implemented, but it looks
                        > > to me
                        > > that it has about as much work in it as some master thesis projects
                        > > and its
                        > > going to be hard to get someone to do it.
                        > >
                        > > I'd like to propose something that would simplify things in the short
                        > > term
                        > > and hopefully make any future changes/testing easier.
                        > >
                        > > Let's change the definition of the result of filter to simply be
                        > >
                        > > *true* means to continue *normal processing*.
                        > >
                        > > (this compliments the result of the pre-fetch filters)
                        > >
                        > > I think everyone likely to read this knows what normal processing
                        > > means but
                        > > just to clarify what I think it means, let's say that normal
                        > > processing is
                        > > that a URI progresses step by step thru the chain illustrated in the user
                        > > manual in section 6.1.3.
                        > >
                        > > As an extension of this I'd like say that a filter that has *enabled* set
                        > > to false or is incompletely configured (for example a regexp filter
                        > > with no
                        > > filter string) should normally return true. (As an example of a current
                        > > problem related to this, just set enabled in the pathdepth filter
                        > > configured in the default profile to false.)
                        > >
                        > > I was going to suggest this last week and post some sample code but had a
                        > > machine run out of disk storage while doing a crawl and linux instead of
                        > > just crashing the job reused the disk space that was being occupied by
                        > > recently referenced files. As a result of that problem I had to
                        > > recover my
                        > > source and retest everything.
                        > >
                        > > As part of that I downloaded the current head and implemented this (so
                        > > now
                        > > I have real code). Right now it is working but I need to review the
                        > > following.
                        > >
                        > > Most filters can use the "return true" if not enabled or configured idea
                        > > except some filters used in scopes. They may have a special problem
                        > > which
                        > > I'm going to look at later today.
                        > >
                        > > The following formula is in the user manual
                        > > ( ( focusFilter.accepts(u)
                        > > || transitiveFilter.accepts(u) )
                        > > && exclusionFilter.accepts(u) == false )
                        > >
                        > > The accurate version of it is in the developer manual
                        > >
                        > > protected final boolean innerAccepts(Object o) {
                        > > return ((isSeed(o) || focusAccepts(o)) ||
                        > > additionalFocusAccepts(o) ||
                        > > transitiveAccepts(o)) && !excludeAccepts(o);
                        > >
                        > > the potential problem is in the focusAccepts and
                        > > additionalFocusAccepts calls
                        > >
                        > > btw, in my code I've changed this equation to be
                        > >
                        > > return (((isSeed(o) || focusAccepts(o)) || additionalFocusAccepts(o) ||
                        > > transitiveAccepts(o)) && excludeAccepts(o);
                        > >
                        > > I'm not too worried if the focusAccepts have to remain special cases
                        > > because end users are not too likely to be changing them.
                        > >
                        > > the need for the "OR" filter has vanished.
                        > >
                        > > here is the diff between the current profile in head and the proposed
                        > > new one
                        > >
                        > > 7,43c29,38
                        > > < <newObject name="exclude-filter"
                        > > class="org.archive.crawler.filter.OrFilter">
                        > > < <boolean name="enabled">true</boolean>
                        > > < <boolean name="if-matches-return">true</boolean>
                        > > < <map name="filters">
                        > > < <newObject name="pathdepth"
                        > > < class="org.archive.crawler.filter.PathDepthFilter">
                        > > < <boolean name="enabled">true</boolean>
                        > > < <integer name="max-path-depth">20</integer>
                        > > < <boolean
                        > > name="path-less-or-equal-return">false</boolean>
                        > > < </newObject>
                        > > < <newObject name="pathologicalpath"
                        > > <
                        > > class="org.archive.crawler.filter.PathologicalPathFilter">
                        > > < <boolean name="enabled">true</boolean>
                        > > < <integer name="repetitions">3</integer>
                        > > < </newObject>
                        > > < </map>
                        > > < </newObject>
                        > > ---
                        > > > <map name="exclude-filters">
                        > > > <newObject name="pathdepth"
                        > > class="org.archive.crawler.filter.PathDepthFilter">
                        > > > <boolean name="enabled">true</boolean>
                        > > > <integer name="max-path-depth">20</integer>
                        > > > </newObject>
                        > > > <newObject name="pathological"
                        > > class="org.archive.crawler.filter.PathologicalPathFilter">
                        > > > <boolean name="enabled">true</boolean>
                        > > > <integer name="repetitions">3</integer>
                        > > > </newObject>
                        > > > </map>
                        > >
                        > > I'd post code or diff's but there is just too much to post all of it and
                        > > until there is some discussion about it I would not know what to use for
                        > > examples... and right now some of it looks hacked (I did not delete the
                        > > previous code, I usually just commented it out and its full of print
                        > > statements), I need to clean it up.
                        > >
                        > > I'm serious enough about this that if someone is using special filters
                        > > and
                        > > does not have a programmer available to change or test them, I'll do
                        > > it for
                        > > them.
                        > >
                        > >
                        > > Dave Skinner dave at solid dot net
                        > > High Performance Programming---assembly (lots of them), C, java, perl
                        > > Database and Non-trivial web site implementations
                        > > Real-time and embedded systems are my specialty
                        > >
                        > >
                        > >
                        > > *Yahoo! Groups Sponsor*
                        > > ADVERTISEMENT
                        > >
                        > >
                        > > ------------------------------------------------------------------------
                        > > *Yahoo! Groups Links*
                        > >
                        > > * To visit your group on the web, go to:
                        > > http://groups.yahoo.com/group/archive-crawler/
                        > >
                        > > * To unsubscribe from this group, send an email to:
                        > > archive-crawler-unsubscribe@yahoogroups.com
                        > >
                        > <mailto:archive-crawler-unsubscribe@yahoogroups.com?subject=Unsubscribe>
                        > >
                        > > * Your use of Yahoo! Groups is subject to the Yahoo! Terms of
                        > > Service <http://docs.yahoo.com/info/terms/>.
                        > >
                        > >
                        >
                        >
                        >
                        >
                        >Yahoo! Groups Links
                        >
                        >
                        >
                        >


                        Dave Skinner dave at solid dot net
                        High Performance Programming---assembly (lots of them), C, java, perl
                        Database and Non-trivial web site implementations
                        Real-time and embedded systems are my specialty
                      • Dave Skinner
                        Before switching to heritrix I had used wget since around 1994. Most of the time I had wget automagically rewrite the URLs in the documents so that they would
                        Message 11 of 24 , Feb 10, 2005
                          Before switching to heritrix I had used wget since around 1994. Most of
                          the time I had wget automagically rewrite the URLs in the documents so that
                          they would all link to each other and could be served in a simple and "user
                          friendly" manner from any machine they were written on. (among other
                          parameters that's -E -K -k) Originally I thought I would just insert a
                          process in front of the write processors in heritrix to rewrite the
                          URLs. But right now I dont think that is the best long term idea. To do
                          it correctly in today's world, not just html (and the script in them) but
                          PDF, PS, DOC, SWF etc files should be modified. Next year yet another
                          complex file format will probably drop onto the web.

                          So I'm thinking about writing a simple proxy server that would look at a
                          URL and decide whether to return a document from local storage or to fetch
                          it from the web. On the first pass I'll just look at a directory structure
                          that could be build by MirrorWriter. Once the general idea is proven to
                          work it could be modified to work with ARC files and some kind of
                          database. I figure on a subsequent pass I'd modify it so that if the local
                          storage has multiple versions of a file the user would be presented with a
                          list of them and s/he could pick the right one. This idea is not
                          perfect. the biggest hole that immediately comes to mind is that anything
                          that is likely to be an embed should probably just default to the most
                          recent and that may not be correct. (sorta like a limited version of the
                          wayback machine, but I'm expecting this to be just a few pages
                          long) Anyway, as I said, it wont be perfect, but should be usable.

                          If I was doing this for just myself I'd probably write it in
                          perl. However, if there is general interest in it, I could use C or
                          java. I assume that anyone running heritrix has to have a current java
                          environment in place and may not have a current perl installed.

                          Any comments?


                          Dave Skinner dave at solid dot net
                          High Performance Programming---assembly (lots of them), C, java, perl
                          Database and Non-trivial web site implementations
                          Real-time and embedded systems are my specialty
                        • stack
                          ... A proxy server that sat atop a filesystem of MirrorWriter made directories and files would be a wonderful addition. Others have done proxy servers to sit
                          Message 12 of 24 , Feb 10, 2005
                            Dave Skinner wrote:

                            > So I'm thinking about writing a simple proxy server that would look at a
                            > URL and decide whether to return a document from local storage or to
                            > fetch
                            > it from the web. On the first pass I'll just look at a directory
                            > structure
                            > that could be build by MirrorWriter. Once the general idea is proven to
                            > work it could be modified to work with ARC files and some kind of
                            > database. I figure on a subsequent pass I'd modify it so that if the
                            > local
                            > storage has multiple versions of a file the user would be presented
                            > with a
                            > list of them and s/he could pick the right one. This idea is not
                            > perfect. the biggest hole that immediately comes to mind is that
                            > anything
                            > that is likely to be an embed should probably just default to the most
                            > recent and that may not be correct. (sorta like a limited version of the
                            > wayback machine, but I'm expecting this to be just a few pages
                            > long) Anyway, as I said, it wont be perfect, but should be usable.
                            >
                            A proxy server that sat atop a filesystem of MirrorWriter made
                            directories and files would be a wonderful addition. Others have done
                            proxy servers to sit atop collections of ARCs. The most mature is
                            probably that done by our Danish brothers, the 'ProxyViewer' application
                            (See http://www.netarchive.dk/website/sources/index-en.htm). Another,
                            less mature offering, is ARC Server
                            (http://archive-access.sourceforge.net/projects/arc-collection-proxy/).

                            St.Ack

                            >
                            > Dave Skinner dave at solid dot net
                            > High Performance Programming---assembly (lots of them), C, java, perl
                            > Database and Non-trivial web site implementations
                            > Real-time and embedded systems are my specialty
                            >
                            >
                            >
                            > *Yahoo! Groups Sponsor*
                            > ADVERTISEMENT
                            >
                            >
                            > ------------------------------------------------------------------------
                            > *Yahoo! Groups Links*
                            >
                            > * To visit your group on the web, go to:
                            > http://groups.yahoo.com/group/archive-crawler/
                            >
                            > * To unsubscribe from this group, send an email to:
                            > archive-crawler-unsubscribe@yahoogroups.com
                            > <mailto:archive-crawler-unsubscribe@yahoogroups.com?subject=Unsubscribe>
                            >
                            > * Your use of Yahoo! Groups is subject to the Yahoo! Terms of
                            > Service <http://docs.yahoo.com/info/terms/>.
                            >
                            >
                          Your message has been successfully submitted and would be delivered to recipients shortly.