Loading ...
Sorry, an error occurred while loading the content.

Integrate Lucene

Expand Messages
  • sanjoy
    Hi Ken, This is what I am thinking. I want to get Lucene indexing done before I attempt Solr. 1) I am using Lucene 2.9.0. Nothing against 2.4.1 which seems
    Message 1 of 5 , Nov 1, 2009
    View Source
    • 0 Attachment
      Hi Ken,

      This is what I am thinking. I want to get Lucene indexing done before I attempt Solr.

      1) I am using Lucene 2.9.0. Nothing against 2.4.1 which seems to be the version in Maven, but I couldn't download the sources for Lucene 2.4.1. The tar and zip seem to have a problem on the Lucene site.

      2) I have added a couple of lines to SiteCrawler.java.
      Tap luceneSink = new Hfs(new SequenceFile( FetchedDatum.FIELDS.append(MetaData.FIELDS)), curCrawlDirName + "/lucene");

      LucenePipe lucenePipe = new LucenePipe( fetchPipe.getContentTailPipe(), MetaData.FIELDS);

      sinkMap.put( LucenePipe.LUCENE_PIPE_NAME, luceneSink);

      Flow flow = flowConnector.connect(inputSource, sinkMap, fetchPipe, urlPipe, lucenePipe);


      3) Adding a class named bixo.pipes.LucenePipe modeled on ParsePipe.

      4) Adding a class named bixo.operations.LuceneFunction modeled on ParseFunction. The writing of the Lucene index will happen in the operate() function.

      I have these done. I will borrow the code to write the index from existing test code.

      Let me know what you think,
      Sanjoy
    • Ken Krugler
      Hi Sanjoy, Sorry for the delay in responding, I ve been busy with the ACM data mining unconference (prep, and writeups - see http://bixolabs.com/blog/xxx) . Re
      Message 2 of 5 , Nov 3, 2009
      View Source
      • 0 Attachment
        Hi Sanjoy,

        Sorry for the delay in responding, I've been busy with the ACM data mining unconference (prep, and writeups - see http://bixolabs.com/blog/xxx).

        Re how to integrate indexing into SimpleCrawlTool - I would create a separate SimpleIndexTool, which can be run on the output directories of SimpleCrawlTool.

        You could clone the SimpleStatusTool, since that's very similar, but with the change (as per a previous email) of processing the data using Cascading versus directly opening taps & iterating.

        I just wrote some utility code to help create a Cascading source tap that's the set of crawl output subdirs you'd want to process for something like building the Lucene index.

        Re how to output the Lucene index - since this is an output format, conceptually I think the right place is in a Cascading Schema, which already exists as IndexSchema in Bixo.

        But I think I'd need to see your new code to better understand what looks different when handling this as a LuceneFunction.

        Thanks,

        -- Ken

        On Nov 1, 2009, at 11:32am, sanjoy wrote:

        Hi Ken,

        This is what I am thinking. I want to get Lucene indexing done before I attempt Solr.

        1) I am using Lucene 2.9.0. Nothing against 2.4.1 which seems to be the version in Maven, but I couldn't download the sources for Lucene 2.4.1. The tar and zip seem to have a problem on the Lucene site.

        2) I have added a couple of lines to SiteCrawler. java.
        Tap luceneSink = new Hfs(new SequenceFile( FetchedDatum. FIELDS.append( MetaData. FIELDS)), curCrawlDirName + "/lucene"); 

        LucenePipe lucenePipe = new LucenePipe( fetchPipe.getConten tTailPipe( ), MetaData.FIELDS) ; 

        sinkMap.put( LucenePipe.LUCENE_ PIPE_NAME, luceneSink); 

        Flow flow = flowConnector. connect(inputSou rce, sinkMap, fetchPipe, urlPipe, lucenePipe); 

        3) Adding a class named bixo.pipes.LucenePi pe modeled on ParsePipe.

        4) Adding a class named bixo.operations. LuceneFunction modeled on ParseFunction. The writing of the Lucene index will happen in the operate() function.

        I have these done. I will borrow the code to write the index from existing test code.

        Let me know what you think,
        Sanjoy


        --------------------------------------------
        Ken Krugler
        +1 530-210-6378
        e l a s t i c   w e b   m i n i n g




      • sanjoy
        Hi Ken, Could you pass me the utility code you have. For indexing the output of the crawler, my only concern was this will double the storage requirement since
        Message 3 of 5 , Nov 3, 2009
        View Source
        • 0 Attachment
          Hi Ken,

          Could you pass me the utility code you have.

          For indexing the output of the crawler, my only concern was this will double the storage requirement since we are storing all the content and also indexing it. That's why I was indexing on the fly. Plus it will be faster since the stored content is not reread.

          If you still think we should index the stored content I will code for that.

          Here's what I have for LuceneFunction:

          public class LuceneFunction extends BaseOperation<NullContext> implements Function<NullContext> {

          private Fields _metaDataFields;

          public LuceneFunction (Fields metaDataFields) {
          super( new Fields( "ParsedDatum-parsedText"));
          _metaDataFields = metaDataFields;
          }


          @Override
          public void operate(FlowProcess process, FunctionCall funcCall) {
          TupleEntry entry = funcCall.getArguments();
          TupleEntryCollector collector = funcCall.getOutputCollector();

          String value = entry.getString( "ParsedDatum-parsedText");
          TupleEntry boost = new TupleEntry( new Fields( "ParsedDatum-parsedText"), new Tuple( value));
          collector.add(boost);
          }
          }

          Thanks,
          Sanjoy



          --- In bixo-dev@yahoogroups.com, Ken Krugler <KKrugler_lists@...> wrote:
          >
          > Hi Sanjoy,
          >
          > Sorry for the delay in responding, I've been busy with the ACM data
          > mining unconference (prep, and writeups - see http://bixolabs.com/blog/xxx)
          > .
          >
          > Re how to integrate indexing into SimpleCrawlTool - I would create a
          > separate SimpleIndexTool, which can be run on the output directories
          > of SimpleCrawlTool.
          >
          > You could clone the SimpleStatusTool, since that's very similar, but
          > with the change (as per a previous email) of processing the data using
          > Cascading versus directly opening taps & iterating.
          >
          > I just wrote some utility code to help create a Cascading source tap
          > that's the set of crawl output subdirs you'd want to process for
          > something like building the Lucene index.
          >
          > Re how to output the Lucene index - since this is an output format,
          > conceptually I think the right place is in a Cascading Schema, which
          > already exists as IndexSchema in Bixo.
          >
          > But I think I'd need to see your new code to better understand what
          > looks different when handling this as a LuceneFunction.
          >
          > Thanks,
          >
          > -- Ken
          >
          > On Nov 1, 2009, at 11:32am, sanjoy wrote:
          >
          > > Hi Ken,
          > >
          > > This is what I am thinking. I want to get Lucene indexing done
          > > before I attempt Solr.
          > >
          > > 1) I am using Lucene 2.9.0. Nothing against 2.4.1 which seems to be
          > > the version in Maven, but I couldn't download the sources for Lucene
          > > 2.4.1. The tar and zip seem to have a problem on the Lucene site.
          > >
          > > 2) I have added a couple of lines to SiteCrawler.java.
          > > Tap luceneSink = new Hfs(new
          > > SequenceFile( FetchedDatum.FIELDS.append(MetaData.FIELDS)),
          > > curCrawlDirName + "/lucene");
          > >
          > > LucenePipe lucenePipe = new
          > > LucenePipe( fetchPipe.getContentTailPipe(), MetaData.FIELDS);
          > >
          > > sinkMap.put( LucenePipe.LUCENE_PIPE_NAME, luceneSink);
          > >
          > > Flow flow = flowConnector.connect(inputSource, sinkMap, fetchPipe,
          > > urlPipe, lucenePipe);
          > >
          > > 3) Adding a class named bixo.pipes.LucenePipe modeled on ParsePipe.
          > >
          > > 4) Adding a class named bixo.operations.LuceneFunction modeled on
          > > ParseFunction. The writing of the Lucene index will happen in the
          > > operate() function.
          > >
          > > I have these done. I will borrow the code to write the index from
          > > existing test code.
          > >
          > > Let me know what you think,
          > > Sanjoy
          > >
          > >
          > >
          >
          > --------------------------------------------
          > Ken Krugler
          > +1 530-210-6378
          > http://bixolabs.com
          > e l a s t i c w e b m i n i n g
          >
        • Ken Krugler
          Hi Sanjoy, In the code below, it looks like you re creating a new tuple that has just the parsed text. If so, then how does this tie into the process of
          Message 4 of 5 , Nov 4, 2009
          View Source
          • 0 Attachment
            Hi Sanjoy,

            In the code below, it looks like you're creating a new tuple that has just the parsed text.

            If so, then how does this tie into the process of generating a Lucene index?

            Thanks,

            -- Ken


            On Nov 3, 2009, at 3:05pm, sanjoy wrote:

            Hi Ken, 

            Could you pass me the utility code you have.

            For indexing the output of the crawler, my only concern was this will double the storage requirement since we are storing all the content and also indexing it. That's why I was indexing on the fly. Plus it will be faster since the stored content is not reread.

            If you still think we should index the stored content I will code for that.

            Here's what I have for LuceneFunction:

            public class LuceneFunction extends BaseOperation< NullContext> implements Function<NullContex t> {

            private Fields _metaDataFields;

            public LuceneFunction (Fields metaDataFields) {
            super( new Fields( "ParsedDatum- parsedText" ));
            _metaDataFields = metaDataFields;
            }

            @Override
            public void operate(FlowProcess process, FunctionCall funcCall) {
            TupleEntry entry = funcCall.getArgumen ts();
            TupleEntryCollector collector = funcCall.getOutputC ollector( );

            String value = entry.getString( "ParsedDatum- parsedText" );
            TupleEntry boost = new TupleEntry( new Fields( "ParsedDatum- parsedText" ), new Tuple( value));
            collector.add( boost);
            }
            }

            Thanks,
            Sanjoy

            --- In bixo-dev@yahoogroup s.com, Ken Krugler <KKrugler_lists@ ...> wrote:
            >
            > Hi Sanjoy,
            > 
            > Sorry for the delay in responding, I've been busy with the ACM data 
            > mining unconference (prep, and writeups - see http://bixolabs. com/blog/ xxx) 
            > .
            > 
            > Re how to integrate indexing into SimpleCrawlTool - I would create a 
            > separate SimpleIndexTool, which can be run on the output directories 
            > of SimpleCrawlTool.
            > 
            > You could clone the SimpleStatusTool, since that's very similar, but 
            > with the change (as per a previous email) of processing the data using 
            > Cascading versus directly opening taps & iterating.
            > 
            > I just wrote some utility code to help create a Cascading source tap 
            > that's the set of crawl output subdirs you'd want to process for 
            > something like building the Lucene index.
            > 
            > Re how to output the Lucene index - since this is an output format, 
            > conceptually I think the right place is in a Cascading Schema, which 
            > already exists as IndexSchema in Bixo.
            > 
            > But I think I'd need to see your new code to better understand what 
            > looks different when handling this as a LuceneFunction.
            > 
            > Thanks,
            > 
            > -- Ken
            > 
            > On Nov 1, 2009, at 11:32am, sanjoy wrote:
            > 
            > > Hi Ken,
            > >
            > > This is what I am thinking. I want to get Lucene indexing done 
            > > before I attempt Solr.
            > >
            > > 1) I am using Lucene 2.9.0. Nothing against 2.4.1 which seems to be 
            > > the version in Maven, but I couldn't download the sources for Lucene 
            > > 2.4.1. The tar and zip seem to have a problem on the Lucene site.
            > >
            > > 2) I have added a couple of lines to SiteCrawler. java.
            > > Tap luceneSink = new Hfs(new 
            > > SequenceFile( FetchedDatum. FIELDS.append( MetaData. FIELDS)), 
            > > curCrawlDirName + "/lucene");
            > >
            > > LucenePipe lucenePipe = new 
            > > LucenePipe( fetchPipe.getConten tTailPipe( ), MetaData.FIELDS) ;
            > >
            > > sinkMap.put( LucenePipe.LUCENE_ PIPE_NAME, luceneSink);
            > >
            > > Flow flow = flowConnector. connect(inputSou rce, sinkMap, fetchPipe, 
            > > urlPipe, lucenePipe);
            > >
            > > 3) Adding a class named bixo.pipes.LucenePi pe modeled on ParsePipe.
            > >
            > > 4) Adding a class named bixo.operations. LuceneFunction modeled on 
            > > ParseFunction. The writing of the Lucene index will happen in the 
            > > operate() function.
            > >
            > > I have these done. I will borrow the code to write the index from 
            > > existing test code.
            > >
            > > Let me know what you think,
            > > Sanjoy
            > >
            > >
            > > 
            > 
            > ------------ --------- --------- --------- -----
            > Ken Krugler
            > +1 530-210-6378
            > http://bixolabs. com
            > e l a s t i c w e b m i n i n g
            >


            --------------------------------------------
            Ken Krugler
            +1 530-210-6378
            e l a s t i c   w e b   m i n i n g




          • sanjoy
            Yup, that s right. It then does Fields indexFields = new Fields( ParsedDatum-parsedText ); Store[] storeSettings = new Store[] { Store.YES }; Index[]
            Message 5 of 5 , Nov 5, 2009
            View Source
            • 0 Attachment
              Yup, that's right.

              It then does

              Fields indexFields = new Fields( "ParsedDatum-parsedText");
              Store[] storeSettings = new Store[] { Store.YES };
              Index[] indexSettings = new Index[] { Index.ANALYZED };
              indexScheme = new IndexScheme( indexFields, storeSettings, indexSettings, false, StandardAnalyzer.class, MaxFieldLength.UNLIMITED.getLimit());

              in LucenePipe.java. It basically pipes the field into IndexScheme.

              This IndexScheme is defined as a Sink in SiteCrawler.java.

              LucenePipe lucenePipe = new LucenePipe( parsePipe, MetaData.FIELDS);
              Tap luceneSink = new Lfs( lucenePipe.getIndexScheme(), urCrawlDirName + "/lucene", SinkMode.REPLACE);

              and I add this into the sinkMap in SiteCrawler.java. That's the connection.

              Please let me know if this is not the right way,
              Sanjoy



              --- In bixo-dev@yahoogroups.com, Ken Krugler <KKrugler_lists@...> wrote:
              >
              > Hi Sanjoy,
              >
              > In the code below, it looks like you're creating a new tuple that has
              > just the parsed text.
              >
              > If so, then how does this tie into the process of generating a Lucene
              > index?
              >
              > Thanks,
              >
              > -- Ken
              >
              >
              > On Nov 3, 2009, at 3:05pm, sanjoy wrote:
              >
              > > Hi Ken,
              > >
              > > Could you pass me the utility code you have.
              > >
              > > For indexing the output of the crawler, my only concern was this
              > > will double the storage requirement since we are storing all the
              > > content and also indexing it. That's why I was indexing on the fly.
              > > Plus it will be faster since the stored content is not reread.
              > >
              > > If you still think we should index the stored content I will code
              > > for that.
              > >
              > > Here's what I have for LuceneFunction:
              > >
              > > public class LuceneFunction extends BaseOperation<NullContext>
              > > implements Function<NullContext> {
              > >
              > > private Fields _metaDataFields;
              > >
              > > public LuceneFunction (Fields metaDataFields) {
              > > super( new Fields( "ParsedDatum-parsedText"));
              > > _metaDataFields = metaDataFields;
              > > }
              > >
              > > @Override
              > > public void operate(FlowProcess process, FunctionCall funcCall) {
              > > TupleEntry entry = funcCall.getArguments();
              > > TupleEntryCollector collector = funcCall.getOutputCollector();
              > >
              > > String value = entry.getString( "ParsedDatum-parsedText");
              > > TupleEntry boost = new TupleEntry( new Fields( "ParsedDatum-
              > > parsedText"), new Tuple( value));
              > > collector.add(boost);
              > > }
              > > }
              > >
              > > Thanks,
              > > Sanjoy
              > >
              > > --- In bixo-dev@yahoogroups.com, Ken Krugler <KKrugler_lists@>
              > > wrote:
              > > >
              > > > Hi Sanjoy,
              > > >
              > > > Sorry for the delay in responding, I've been busy with the ACM data
              > > > mining unconference (prep, and writeups - see http://bixolabs.com/blog/xxx)
              > > > .
              > > >
              > > > Re how to integrate indexing into SimpleCrawlTool - I would create a
              > > > separate SimpleIndexTool, which can be run on the output directories
              > > > of SimpleCrawlTool.
              > > >
              > > > You could clone the SimpleStatusTool, since that's very similar, but
              > > > with the change (as per a previous email) of processing the data
              > > using
              > > > Cascading versus directly opening taps & iterating.
              > > >
              > > > I just wrote some utility code to help create a Cascading source tap
              > > > that's the set of crawl output subdirs you'd want to process for
              > > > something like building the Lucene index.
              > > >
              > > > Re how to output the Lucene index - since this is an output format,
              > > > conceptually I think the right place is in a Cascading Schema, which
              > > > already exists as IndexSchema in Bixo.
              > > >
              > > > But I think I'd need to see your new code to better understand what
              > > > looks different when handling this as a LuceneFunction.
              > > >
              > > > Thanks,
              > > >
              > > > -- Ken
              > > >
              > > > On Nov 1, 2009, at 11:32am, sanjoy wrote:
              > > >
              > > > > Hi Ken,
              > > > >
              > > > > This is what I am thinking. I want to get Lucene indexing done
              > > > > before I attempt Solr.
              > > > >
              > > > > 1) I am using Lucene 2.9.0. Nothing against 2.4.1 which seems to
              > > be
              > > > > the version in Maven, but I couldn't download the sources for
              > > Lucene
              > > > > 2.4.1. The tar and zip seem to have a problem on the Lucene site.
              > > > >
              > > > > 2) I have added a couple of lines to SiteCrawler.java.
              > > > > Tap luceneSink = new Hfs(new
              > > > > SequenceFile( FetchedDatum.FIELDS.append(MetaData.FIELDS)),
              > > > > curCrawlDirName + "/lucene");
              > > > >
              > > > > LucenePipe lucenePipe = new
              > > > > LucenePipe( fetchPipe.getContentTailPipe(), MetaData.FIELDS);
              > > > >
              > > > > sinkMap.put( LucenePipe.LUCENE_PIPE_NAME, luceneSink);
              > > > >
              > > > > Flow flow = flowConnector.connect(inputSource, sinkMap, fetchPipe,
              > > > > urlPipe, lucenePipe);
              > > > >
              > > > > 3) Adding a class named bixo.pipes.LucenePipe modeled on
              > > ParsePipe.
              > > > >
              > > > > 4) Adding a class named bixo.operations.LuceneFunction modeled on
              > > > > ParseFunction. The writing of the Lucene index will happen in the
              > > > > operate() function.
              > > > >
              > > > > I have these done. I will borrow the code to write the index from
              > > > > existing test code.
              > > > >
              > > > > Let me know what you think,
              > > > > Sanjoy
              > > > >
              > > > >
              > > > >
              > > >
              > > > --------------------------------------------
              > > > Ken Krugler
              > > > +1 530-210-6378
              > > > http://bixolabs.com
              > > > e l a s t i c w e b m i n i n g
              > > >
              > >
              > >
              > >
              >
              > --------------------------------------------
              > Ken Krugler
              > +1 530-210-6378
              > http://bixolabs.com
              > e l a s t i c w e b m i n i n g
              >
            Your message has been successfully submitted and would be delivered to recipients shortly.