Loading ...
Sorry, an error occurred while loading the content.

Messages List

1446

Re: tags not picked up via the content extractor

Thanks for the reply Ken, I will have a look at the DOMParser. Cheers, Alex
al_hendry
Mar 24
#1446
 
1445

Re: CreateUrlDatumFromOutlinksFunction and LatestUrlDatumBuffer

Hi Ken, I have nothing else I want to add Thanks, Alex
al_hendry
Mar 24
#1445
 
1444

Re: CreateUrlDatumFromOutlinksFunction and LatestUrlDatumBuffer

Hi Alex, Vivek is working on a new release of Bixo, which will hopefully include this fix (and an update to Hadoop 2.4, Cascading 2.6, etc) Is there any other
Ken Krugler
Mar 23
#1444
 
1443

Re: tags not picked up via the content extractor

Hi al_hendry, The issue is that HtmlContentExtractor is being used by SimpleParser, which in turn uses Tika to do the high-level parse of the HTML. And Tika in
Ken Krugler
Mar 23
#1443
 
1442

Re: Anyone using Bixo on Hadoop 2.x?

IÆm starting up 2.6 but the distros are on 2.4 I think. IÆll have a non-trivial crawl to run if you need testing. On Mar 7, 2015, at 9:56 PM, Ken Krugler
Pat Ferrel
Mar 8
#1442
 
1441

Re: Anyone using Bixo on Hadoop 2.x?

Funny you should ask, I was just gearing up to change dependencies to 2.2 (or 2.4àany thoughts on which one?) and give it a try. So yes, I'd also be
Ken Krugler
Mar 7
#1441
 
1440

Anyone using Bixo on Hadoop 2.x?

Does Bixo run on Hadoop 2.x either as a binary or compiled for it?
Pat Ferrel
Mar 7
#1440
 
1439

tags not picked up via the content extractor

Hi all, I have written a content extractor that was based on the HtmlContentExtractor. However when I parse my content some pages have tags and attributes
al_hendry
Feb 26
#1439
 
1436

Re: CreateUrlDatumFromOutlinksFunction and LatestUrlDatumBuffer

Thanks for the quick response Ken. I have added a couple of lines to CreateUrlDatumFromOutlinksFunction. After the payload is set for the UrlDatum
al_hendry
Feb 24
#1436
 
1435

Re: CreateUrlDatumFromOutlinksFunction and LatestUrlDatumBuffer

Hi there, I'll try to take a look at it in more detail tomorrow, but this does look like a bug in the logic used by the LatestUrlDatumBuffer - thanks! How
Ken Krugler
Feb 24
#1435
 
1434

CreateUrlDatumFromOutlinksFunction and LatestUrlDatumBuffer

Hi there, so I first had a quick look at Bixo a few years back but due to work commitments etc I never got to use it for anything. Recently I have had a
al_hendry
Feb 24
#1434
 
1419

Re: DemoStatusTool.java file path input error [2 Attachments]

Hi Perry, ... Each folder inside of the working directory has a name that starts with a loop number (first number is '0') followed by a timestamp. Given the
Ken Krugler
Nov 7, 2014
#1419
 
1418

DemoStatusTool.java file path input error

Hi guys, I keep receiving errors after i specify the working directory in the input parameters can somebody tell me what's wrong? And is the -workingdir the
perrylims
Nov 6, 2014
#1418
This message has attachments
  • PNG
    88 KB
  • PNG
    118 KB
1417

Re: Scoring system

Hi David, I'm not sure what you mean by urls that "don't match the search". Are you using the web mining example? -- Ken ... Ken Krugler +1 530-210-6378
Ken Krugler
Sep 8, 2014
#1417
 
1416

Scoring system

Hello everyone, When mining a large collection of websites or the entire web is it possible to score 0 all urls that don't match the search even if they turn
David Marco
Sep 8, 2014
#1416
 
View First Topic Go to View Last Topic
Loading 1 - 15 of total 1,446 messages