Loading ...
Sorry, an error occurred while loading the content.
1397
Re: Good point to clean out loop dir Hi Pat, ... The only input to DemoCrawlWorkflow.createFlow() is the previous loop's crawlDB. Status is there for the DemoStatusTool to crunch. -- Ken ... Ken
Ken Krugler
Feb 16
#1397
 
1396
Good point to clean out loop dir I have a miner that creates some extra dirs in the loop dir. I need to remove some unneeded dirs because they take a lot of space. So as I understand: 1)
Pat Ferrel
Feb 15
#1396
 
1395
Re: bixo development Hello T, Nutch is definitely more active. But we are maintaining Bixo - and actively using it for projects. And as much as time permits we do respond to any
Vivek Magotra
Jan 13
#1395
 
Fetching Sponsored Content...
1394
bixo development Hi, I am currently deciding on whether to use bixo or nutch as a crawler component as part of a larger system. No searching or indexing is required. Bixo looks
433dd47777afdfa8e36d75d8e59f63c0
Jan 13
#1394
 
1393
Re: focused crawl Ah, idiot moi. The seed urls go into loop 0 crawldb. So they would definitely get deferred if not complete when the duration times out. Just thinking out
Pat Ferrel
Nov 26, 2013
#1393
 
1392
Re: focused crawl Thanks Ken, Using efficient crawl policy and a 10 second crawl delay (one reason itÆs so slow). The trick about using a time to finish is a great idea. That
Pat Ferrel
Nov 25, 2013
#1392
 
1391
Re: focused crawl Hi Pat, ... Normally you don't want to run this type of crawl in complete mode, as otherwise the entire crawl can stall out because a site has a ton of URLs
Ken Krugler
Nov 25, 2013
#1391
 
1390
focused crawl I have a long list of URLs that IÆd like to crawl but no further, in other words no offlinks should be followed. My question is, will Bixo try to break up the
Pat Ferrel
Nov 25, 2013
#1390
 
1389
Re: Slow crawl The previous release version, I think 0.8? Might be able to re-run it. Unfortunately I didn't version control the filters and seed urls, which I've since made
Pat Ferrel
Oct 25, 2013
#1389
 
1388
Bixo 0.9.1 released Hi all, Bixo 0.9.1 has just been released, and is available via Conjars or as a distribution download at
Ken Krugler
Oct 25, 2013
#1388
 
1387
Re: Slow crawl Hi Pat, ... I can't think of any reason why things would have significantly slowed down. What was the version of Bixo that you were using previously, versus
Ken Krugler
Oct 25, 2013
#1387
 
1386
[ANNOUNCEMENT] 0.3 release of crawler-commons Hi, Just to let you know that we have just release the version 0.3 of crawler-commons. Crawler-commons is a set of reusable Java components that implement
Julien Nioche
Oct 11, 2013
#1386
 
1385
Slow crawl I just upgraded to the latest Bixo and added some new page bits to mine. I sent it to mine things on the same site I was mining from in the past but now the
Pat Ferrel
Oct 10, 2013
#1385
 
1384
Re: Flow warning Hi Pat, It looks like the problem is that the SplitterAssembly constructors aren't calling a new super (SubAssembly) constructor that got added in cascading
Chris Schneider
Sep 24, 2013
#1384
 
1383
Re: web mining I see your point--you probably need some kind of least case validation that a datum has the minimum data before it is added to the sink. Maybe this can be
Pat Ferrel
Sep 24, 2013
#1383
 
Fetching Sponsored Content...
1382
Flow warning The Bixo build tests and my running miner get the following warning. I'm not sure this is important since the flow seems to run correctly but I haven't looked
Pat Ferrel
Sep 24, 2013
#1382
 
1381
Re: Maintaining crawl data/state externally from Hadoop ... My experience is that the ability to pin a sparse column into memory and scan it in parallel at extremely high speed is really helpful for this kind of
Ted Dunning
Sep 23, 2013
#1381
 
1380
Maintaining crawl data/state externally from Hadoop Hi Ted, ... Thanks for the ref, I'd forgotten about Percolator. I've looked at using Cassandra to manage crawl state for a continuous crawler, based on Storm.
Ken Krugler
Sep 23, 2013
#1380
 
1379
Re: web mining Interesting paper. We've talked a bit about that here before. Each object being indexed or mined often has an optimum update frequency. Batch crawling makes it
Pat Ferrel
Sep 22, 2013
#1379
 
1378
Re: web mining On Sat, Sep 21, 2013 at 11:25 AM, Ken Krugler ... An increasingly common style is to push content processing (other than link finding) into a separate process
Ted Dunning
Sep 22, 2013
#1378
 
1377
Re: Fetch truncating? Thanks Chris! D'oh, I should have known that! Fixed--the pages are 300K a lot of the time. You are nominated for best support of the year. I owe you one. BTW:
Pat Ferrel
Sep 21, 2013
#1377
 
1376
Re: Fetch truncating? Hi Pat, Please see FetcherPolicy.DEFAULT_MAX_CONTENT_SIZE (64K). This is easily configurable via the FetcherPolicy constructor. FYI, - Chris ... Chris
Chris Schneider
Sep 21, 2013
#1376
 
1375
Fetch truncating? Ah, getting closer. The FetchedDatum has identical text to Curl but FetchedDatum is truncated after line 11313 while Curl returns the total 29363 lines. The
Pat Ferrel
Sep 21, 2013
#1375
 
1374
Re: Xpath Hi Pat, First of all, if you're going to use XPath, then as far as I know you MUST use something like TagSoup, NekoHTML, Tidy, etc. to clean up broken HTML.
Ken Krugler
Sep 21, 2013
#1374
 
1373
Re: web mining Hi Pat, ... This approach has appeal, for sure. Issues I've run into when trying to do this type of refactoring, that you might want to considerà 1. Often
Ken Krugler
Sep 21, 2013
#1373
 
1372
Re: Xpath Hmm, there seem to be several things going on here. First the ParsedDatum is missing some HTML elements, maybe "cleaned?" So I took the FetchedDatum from
Pat Ferrel
Sep 21, 2013
#1372
 
1371
Re: Xpath Hi Pat, This is why SimpleParser uses Tika to parse the web page content and extract text from it. Tika is designed to clean up the arbitrary XML so that it
Chris Schneider
Sep 21, 2013
#1371
 
Fetching Sponsored Content...
1370
Xpath Using several online xpath testers I find most fail on any slightly malformed xml. Meaning they return no result if an & is left raw in High & Away
Pat Ferrel
Sep 21, 2013
#1370
 
1369
web mining After a couple iterations of creating new code to mine new pages it seems like there is a pattern here. I wonder if anyone else is using something similar. 1)
Pat Ferrel
Sep 20, 2013
#1369
 
1368
Re: upgrade Since it was working a diff solved the problem. I'd switched around where I connected to the parsePipe accidentally. Still not sure why that matters--I admit
Pat Ferrel
Sep 17, 2013
#1368
 
View First Topic Go to View Last Topic
Loading 1 - 30 of total 1,397 messages