Loading ...
Sorry, an error occurred while loading the content.

Messages List

8713

Re: Account required to view Heritrix documentation

Hello Lauren, Earlier this week apparently there was some unusual traffic against the wiki, and anonymous access has been temporarily suspended, until we and
Noah Levitt
Jul 3
#8713
 
8712

Account required to view Heritrix documentation

The Heritrix documentation at https://webarchive.jira.com/wiki/display/Heritrix/Heritrix is now requiring account creation and log in to view. Is that
Ko, Lauren
Jul 3
#8712
 
8711

Re: Job Management ... workload balancing

Additional question: 2a. Job runs on 3 machines, what happens when one of the 3 machines die? Will the other 2 pickup the rest of work or will the job die.
helenchenpoon
Jun 22
#8711
 
8710

Job Management ... workload balancing

I am new to Heritrix. Matter the fact, I am still in the evaluation stage. Say I have 20 large sites to crawl and 200 small sites. Since the scope and
helenchenpoon
Jun 22
#8710
 
8709

Re: Any ideas on validating crawlertraps before inserting them into

You could use an xml library to modify your cxml. That would take care of encoding things as necessary. Noah On Tue, Jun 9, 2015 at 7:36 AM, Søren Vejrup
Noah Levitt
Jun 10
#8709
 
8708

SV: Any ideas on validating crawlertraps before inserting them into

Hi Kristinn. Thanks for your reply. I don't know, If we will follow your suggestion. It's easier in NAS to embed them directly into the template than refer to
Søren Vejrup Carlsen
Jun 10
#8708
 
8707

Re: Any ideas on validating crawlertraps before inserting them into

I got sufficiently sick of this XML encoding of regexes that I made a variant to the MatchesListRegexDecideRule that reads the regexes from a plain text file.
Kristinn Sigurðsson
Jun 10
#8707
 
8706

Any ideas on validating crawlertraps before inserting them into a H3

Hi all. We're currently testing NetarchiveSuite 5.0, and and have a test that adds a list of global crawlertraps (i.e. traps inserted into the
Søren Vejrup Carlsen
Jun 9
#8706
 
8705

Re: Focused crawling

Helen, It would be better to roll out your own extractors for this. Extend the Extractor class. You can do file IO from there. I have a simple tutorial here:
Shriphani Palakodety
Jun 5
#8705
 
8704

Focused crawling

I am investigating various open source including Heritrix to do the following. Crawling products from a list of know websites like amazon. To specify a
helenchenpoon
Jun 3
#8704
 
8703

Re: Does heritrix3 have ways to extract licensing from webpages

Hello Maarten, I wasn't aware of this before but I see that archive.org uses rel="license" to point to the license of some items. For example:
Noah Levitt
May 9
#8703
 
8702

Does heritrix3 have ways to extract licensing from webpages

Hi, Little bit of background: I work at Kennisland, a Dutch think tank that also works for heritage institutions. Kennisland also represents Creative Commons
Maarten Zeinstra
May 9
#8702
 
8701

Does heritrix3 have ways to extract licensing from webpages

Hi, Little bit of background: I work at Kennisland, a Dutch think tank that also works for heritage institutions. Kennisland also represents Creative Commons
Maarten Zeinstra
May 9
#8701
 
8700

Re: FTP Crawling

Hello Markus, FetchFTP doesn't support urls with userinfo. You can configure authentication using the username and password parameters of the fetcher. To apply
Noah Levitt
Apr 20
#8700
 
8699

FTP Crawling

Hallo, I have a problem crawling a FTP site. I added the FTP fetch module as described here:
Markus.Mirsberger
Apr 18
#8699
 
View First Topic Go to View Last Topic
Loading 1 - 15 of total 8,713 messages