Loading ...
Sorry, an error occurred while loading the content.

Messages List

1439

tags not picked up via the content extractor Hi all, I have written a content extractor that was based on the HtmlContentExtractor. However when I parse my content some pages have tags and attributes

al_hendry
Feb 26
#1439
 

Messages List

1436

Re: CreateUrlDatumFromOutlinksFunction and LatestUrlDatumBuffer Thanks for the quick response Ken. I have added a couple of lines to CreateUrlDatumFromOutlinksFunction. After the payload is set for the UrlDatum

al_hendry
Feb 24
#1436
 

Messages List

1435

Re: CreateUrlDatumFromOutlinksFunction and LatestUrlDatumBuffer Hi there, I'll try to take a look at it in more detail tomorrow, but this does look like a bug in the logic used by the LatestUrlDatumBuffer - thanks! How

Ken Krugler
Feb 24
#1435
 

Messages List

1434

CreateUrlDatumFromOutlinksFunction and LatestUrlDatumBuffer Hi there, so I first had a quick look at Bixo a few years back but due to work commitments etc I never got to use it for anything. Recently I have had a

al_hendry
Feb 24
#1434
 

Messages List

1419

Re: DemoStatusTool.java file path input error [2 Attachments] Hi Perry, ... Each folder inside of the working directory has a name that starts with a loop number (first number is '0') followed by a timestamp. Given the

Ken Krugler
Nov 7, 2014
#1419
 

Messages List

1418

DemoStatusTool.java file path input error Hi guys, I keep receiving errors after i specify the working directory in the input parameters can somebody tell me what's wrong? And is the -workingdir the

perrylims
Nov 6, 2014
#1418
This message has attachments
  • PNG
    88 KB
  • PNG
    118 KB

Messages List

1417

Re: Scoring system Hi David, I'm not sure what you mean by urls that "don't match the search". Are you using the web mining example? -- Ken ... Ken Krugler +1 530-210-6378

Ken Krugler
Sep 8, 2014
#1417
 

Messages List

1416

Scoring system Hello everyone, When mining a large collection of websites or the entire web is it possible to score 0 all urls that don't match the search even if they turn

David Marco
Sep 8, 2014
#1416
 

Messages List

1415

Re: sitemap.xml ... The sitemap parsing support in crawler-commons should support the full sitemap spec. ... These are separate concepts. See FetchBuffer source for how both

Ken Krugler
Jun 18, 2014
#1415
 

Messages List

1414

Re: sitemap.xml Using a variant of the mining code getting URLs with Xpath from sitemaps would be easy. The full sitemap spec allows includes of compressed files so

Pat Ferrel
Jun 18, 2014
#1414
 

Messages List

1413

Re: sitemap.xml ... You'd only have a gigantic single loop if you configured the fetching policy to be "complete" versus efficient, and if you didn't have a time limit. -- Ken

Ken Krugler
Jun 17, 2014
#1413
 

Messages List

1412

Re: sitemap.xml Yeah, it would be nice to parse the sitemap to mine URLs then send them through the same filters like any other page. IÆd consider doing this as the next step

Pat Ferrel
Jun 17, 2014
#1412
 

Messages List

1411

Re: sitemap.xml Hi Pat, The way I'd like to handle site maps is the same as any other file. E.g. when processing robots.txt, extract the site maps and add that as a "found"

Ken Krugler
Jun 17, 2014
#1411
 

Messages List

1410

sitemap.xml Crawling sites has become problematic in cases where the service has heavier weight javascript + Json client code. This is especially so for lists. Take my

Pat Ferrel
Jun 17, 2014
#1410
 

Messages List

1409

Re: ParsePipe in DemoWebMiningTool results in a faulty DOM structure Hey Ken I have attached the two files...one is from the contentBytes field of fetchedDatum and the other one is from the parsedText field of parsedDatum. Just

sridhar_shrey
Jun 12, 2014
#1409
This message has attachments
  • 942 KB
    topchart_parseddatum
  • 821 KB
    topchart_fetcheddatum
View First Topic Go to View Last Topic
Loading 1 - 15 of total 1,439 messages