Loading ...
Sorry, an error occurred while loading the content.

Re: [bixo-dev] Right way to tap into a pipe

Expand Messages
  • Ken Krugler
    ... This is something Vivek has done in the past, so I m hoping he ll respond. -- Ken ... Ken Krugler http://www.scaleunlimited.com custom big data solutions &
    Message 1 of 16 , Feb 9, 2012
    View Source
    • 0 Attachment

      On Feb 9, 2012, at 12:06am, Michele Costabile wrote:


      Il giorno 09/feb/2012, alle ore 06:05, Ken Krugler ha scritto:

      1. You're assuming that the web page content can be converted to a string, using the default platform encoding. That's what happens with:

      String content = new String(datum.getContentBytes());

      This will work for some web pages (maybe 60% on average) if your platform encoding is UTF-8.

      But Tika does a lot of work to try to handle the other 40%.

      That's a point indeed and, by the way, it is true that parsing one page I am working on takes about one second (it is 420k), but the advantage of the parser I am using is that I can have the dom and query it with CSS3 selectors, that come in very handy to get at the content I need.
      How can I have at least access to the dom with a Tika parser?

      This is something Vivek has done in the past, so I'm hoping he'll respond.

      -- Ken

      --------------------------
      Ken Krugler
      custom big data solutions & training
      Hadoop, Cascading, Mahout & Solr




    • Michele Costabile
      Hi Ken, I have done a lot of homework and delved into BaseContentExtractor, BaseLinkExtrator and what is built upon them. I found that it is very easy to
      Message 2 of 16 , Feb 16, 2012
      View Source
      • 0 Attachment
        Hi Ken, I have done a lot of homework and delved into BaseContentExtractor, BaseLinkExtrator and what is built upon them.
        I found that it is very easy to create a custom SAX parser and plug it into the workflow, for example changing SimpleCrawlWorkflow like this

        // Take content and split it into content output plus parse to extract URLs.
        - SimpleParser parser = new SimpleParser();
        + SimpleParser parser = new SimpleParser(new MyOwnContentExtractor(), new SimpleLinkExtractor(), new ParserPolicy(), null);

        MyOwnContentExtractor can parse the incoming Html and do its job easily. The output will be some structured data. In a project I am considering it will be contact data for professionals listed in showcase sites, in an other it will be names, brands and prices for products on display in commerce sites.
        The problem is that recovering that data is not obvious.
        I could treat it like outlinks, but this means stuffing my extracted products or addresses in ParsedDatum and changing all the dependencies, that means coupling it a lot with the general mechanism of the parser.
        In a previous message, Vivek suggested to create a side pipe and a function for parsing Html and creating structured data, this is easy and clean, requesting only four lines in SimpleCrawlWorkflow:

        + Path productsDirPath = new Path(curWorkingDirPath, CrawlConfig.PRODUCTS_SUBDIR_NAME);
        + Tap productsSink = new Hfs(new SequenceFile(ProductDatum.FIELDS), productsDirPath.toString());

        ...

        + Pipe productsPipe = new Pipe("products pipe", parsePipe.getTailPipe());
        + productsPipe = new Each(productsPipe, new CreateProductDatumsFunction());

        I have been able to tap from fetchPipe and parsePipe, in this case I chose parsePipe.
        in CreateProductDatumsFunction I need to get at the products. I copied the mechanism that passes around outlinks, creating

        + public Product[] getProducts() {
        + return convertTupleToProducts((Tuple)_tupleEntry.get(PRODUCTS_FN));
        + }
        +
        + public void setProducts(Product[] products) {
        + _tupleEntry.set(PRODUCTS_FN, convertProductsToTuple(products));
        + }

        All in all, it starts working, but it seems to me too involved and coupled to the original source code.
        I think this is a general problem when creating data on the fly instead of loading all the content of the scanned sites on disk to examine it later, so, I guess, this problem has already been solved as it usually happens :-)

        Il giorno 09/feb/2012, alle ore 18:21, Ken Krugler ha scritto:

        >
        >
        > On Feb 9, 2012, at 12:06am, Michele Costabile wrote:
        >
        >>
        >> Il giorno 09/feb/2012, alle ore 06:05, Ken Krugler ha scritto:
        >>
        >>> 1. You're assuming that the web page content can be converted to a string, using the default platform encoding. That's what happens with:
        >>>
        >>> String content = new String(datum.getContentBytes());
        >>>
        >>> This will work for some web pages (maybe 60% on average) if your platform encoding is UTF-8.
        >>>
        >>> But Tika does a lot of work to try to handle the other 40%.
        >>
        >> That's a point indeed and, by the way, it is true that parsing one page I am working on takes about one second (it is 420k), but the advantage of the parser I am using is that I can have the dom and query it with CSS3 selectors, that come in very handy to get at the content I need.
        >> How can I have at least access to the dom with a Tika parser?
        >
        > This is something Vivek has done in the past, so I'm hoping he'll respond.
      • Ken Krugler
        Hi Michele, If I understand what you re trying to do, then I think the approach I would take is… 1. Add my own ContentExtractor, which just writes out the
        Message 3 of 16 , Feb 17, 2012
        View Source
        • 0 Attachment
          Hi Michele,

          If I understand what you're trying to do, then I think the approach I would take is…

          1. Add my own ContentExtractor, which just writes out the cleaned up HTML that it gets back from Tika.

          I think Vivek has an example of how to do this, so that you get all of the HTML (and attributes), not just the ones that Tika thinks are important.

          2. Then take the output of the ParsePipe and run it into a custom function which Dom4J to build the Document object, and XPath to process it.

          The above is the cleanest approach, but it means you're parsing the page twice - once with TagSoup, to clean it up, and once with Dom4J.

          There's a one step approach that's more complex, where you replace ParsePipe with your own version, and you feed the SAX events from Tika/TagSoup to Dom4J to create the Document model.

          -- Ken

          On Feb 16, 2012, at 9:29am, Michele Costabile wrote:

          Hi Ken, I have done a lot of homework and delved into BaseContentExtractor, BaseLinkExtrator and what is built upon them.
          I found that it is very easy to create a custom SAX parser and plug it into the workflow, for example changing SimpleCrawlWorkflow like this

                  // Take content and split it into content output plus parse to extract URLs.
          -        SimpleParser parser = new SimpleParser();
          +        SimpleParser parser = new SimpleParser(new MyOwnContentExtractor(), new SimpleLinkExtractor(), new ParserPolicy(), null);

          MyOwnContentExtractor can parse the incoming Html and do its job easily. The output will be some structured data. In a project I am considering it will be contact data for professionals listed in showcase sites, in an other it will be names, brands and prices for products on display in commerce sites.
          The problem is that recovering that data is not obvious.
          I could treat it like outlinks, but this means stuffing my extracted products or addresses in ParsedDatum and changing all the dependencies, that means coupling it a lot with the general mechanism of the parser.
          In a previous message, Vivek suggested to create a side pipe and a function for parsing Html and creating structured data, this is easy and clean, requesting only four lines in SimpleCrawlWorkflow:

          +        Path productsDirPath = new Path(curWorkingDirPath, CrawlConfig.PRODUCTS_SUBDIR_NAME);
          +        Tap productsSink = new Hfs(new SequenceFile(ProductDatum.FIELDS), productsDirPath.toString());

          ...

          +        Pipe productsPipe = new Pipe("products pipe", parsePipe.getTailPipe());
          +        productsPipe = new Each(productsPipe, new CreateProductDatumsFunction());

          I have been able to tap from fetchPipe and parsePipe, in this case I chose parsePipe.
          in CreateProductDatumsFunction I need to get at the products. I copied the mechanism that passes around outlinks, creating

          +    public Product[] getProducts() {
          +        return convertTupleToProducts((Tuple)_tupleEntry.get(PRODUCTS_FN));
          +    }
          +
          +    public void setProducts(Product[] products) {
          +        _tupleEntry.set(PRODUCTS_FN, convertProductsToTuple(products));
          +    }

          All in all, it starts working, but it seems to me too involved and coupled to the original source code.
          I think this is a general problem when creating data on the fly instead of loading all the content of the scanned sites on disk to examine it later, so, I guess, this problem has already been solved as it usually happens :-)

          Il giorno 09/feb/2012, alle ore 18:21, Ken Krugler ha scritto:



          On Feb 9, 2012, at 12:06am, Michele Costabile wrote:


          Il giorno 09/feb/2012, alle ore 06:05, Ken Krugler ha scritto:

          1. You're assuming that the web page content can be converted to a string, using the default platform encoding. That's what happens with:

          String content = new String(datum.getContentBytes());

          This will work for some web pages (maybe 60% on average) if your platform encoding is UTF-8.

          But Tika does a lot of work to try to handle the other 40%.

          That's a point indeed and, by the way, it is true that parsing one page I am working on takes about one second (it is 420k), but the advantage of the parser I am using is that I can have the dom and query it with CSS3 selectors, that come in very handy to get at the content I need.
          How can I have at least access to the dom with a Tika parser?

          This is something Vivek has done in the past, so I'm hoping he'll respond.



          ------------------------------------

          Yahoo! Groups Links

          <*> To visit your group on the web, go to:
             http://groups.yahoo.com/group/bixo-dev/

          <*> Your email settings:
             Individual Email | Traditional

          <*> To change settings online go to:
             http://groups.yahoo.com/group/bixo-dev/join
             (Yahoo! ID required)

          <*> To change settings via email:
             bixo-dev-digest@yahoogroups.com
             bixo-dev-fullfeatured@yahoogroups.com

          <*> To unsubscribe from this group, send an email to:
             bixo-dev-unsubscribe@yahoogroups.com

          <*> Your use of Yahoo! Groups is subject to:
             http://docs.yahoo.com/info/terms/


          --------------------------
          Ken Krugler
          custom big data solutions & training
          Hadoop, Cascading, Mahout & Solr




        • Michele Costabile
          ... Thank you Ken. I think this is the simplest approach and it is the one I tried in the first place. I would mostly appreciate a working sample that takes
          Message 4 of 16 , Feb 20, 2012
          View Source
          • 0 Attachment
            > 1. Add my own ContentExtractor, which just writes out the cleaned up HTML that it gets back from Tika.
            >
            > I think Vivek has an example of how to do this, so that you get all of the HTML (and attributes), not just the ones that Tika thinks are important.
            >
            > 2. Then take the output of the ParsePipe and run it into a custom function which Dom4J to build the Document object, and XPath to process it.
            >
            > The above is the cleanest approach, but it means you're parsing the page twice - once with TagSoup, to clean it up, and once with Dom4J.

            Thank you Ken. I think this is the simplest approach and it is the one I tried in the first place. I would mostly appreciate a working sample that takes the cleaned-up HTML from the fetch pipe and feeds it through a side pipe where I can stuff a function doing some analysys and create my own tuples.
            This is what I expected from a tagline such as "A web mining toolkit" in the first place, but I would say we are not there yet, at least not with the sources you share.
            I would be willing to help the project, but there is so much to understand that it is not practical to insist if it is not possible to subset the system in some way and concentrate on a single part for starters. More guidance would be needed to succeed. I am totally open to sharing the parts of my work that would be of general use.

            The main obstacle I have found so far is that -- as in the case of outlinks -- there should be a way to generate data in the parse pipe and use it in an other one.
            One other problem is that Bixo is curl on steroids and can download lots of data that might not be relevant, for example, if I am just trying to collect mail addresses in artist sites.

            > There's a one step approach that's more complex, where you replace ParsePipe with your own version, and you feed the SAX events from Tika/TagSoup to Dom4J to create the Document model.

            This would be a more refined approach and I volunteer for coding, but cannot be of help on architecture.
          • Ken Krugler
            Hi Michele, Thanks for sharing your input - often it s hard for me to see where the challenges lie, since I ve been using this framework for 2+ years. I also
            Message 5 of 16 , Feb 20, 2012
            View Source
            • 0 Attachment
              Hi Michele,

              Thanks for sharing your input - often it's hard for me to see where the challenges lie, since I've been using this framework for 2+ years.

              I also appreciate your offer to help make it more accessible. I'm on vacation right now, but will follow up with you by the end of next week.

              Regards,

              -- Ken

              On Feb 20, 2012, at 2:09am, Michele Costabile wrote:

              1. Add my own ContentExtractor, which just writes out the cleaned up HTML that it gets back from Tika.

              I think Vivek has an example of how to do this, so that you get all of the HTML (and attributes), not just the ones that Tika thinks are important.

              2. Then take the output of the ParsePipe and run it into a custom function which Dom4J to build the Document object, and XPath to process it.

              The above is the cleanest approach, but it means you're parsing the page twice - once with TagSoup, to clean it up, and once with Dom4J.

              Thank you Ken. I think this is the simplest approach and it is the one I tried in the first place. I would mostly appreciate a working sample that takes the cleaned-up HTML from the fetch pipe and feeds it through a side pipe where I can stuff a function doing some analysys and create my own tuples.
              This is what I expected from a tagline such as "A web mining toolkit" in the first place, but I would say we are not there yet, at least not with the sources you share.
              I would be willing to help the project, but there is so much to understand that it is not practical to insist if it is not possible to subset the system in some way and concentrate on a single part for starters. More guidance would be needed to succeed. I am totally open to sharing the parts of my work that would be of general use.

              The main obstacle I have found so far is that -- as in the case of outlinks -- there should be a way to generate data in the parse pipe and use it in an other one.
              One other problem is that Bixo is curl on steroids and can download lots of data that might not be relevant, for example, if I am just trying to collect mail addresses in artist sites.

              There's a one step approach that's more complex, where you replace ParsePipe with your own version, and you feed the SAX events from Tika/TagSoup to Dom4J to create the Document model.

              This would be a more refined approach and I volunteer for coding, but cannot be of help on architecture.



              ------------------------------------

              Yahoo! Groups Links

              <*> To visit your group on the web, go to:
                 http://groups.yahoo.com/group/bixo-dev/

              <*> Your email settings:
                 Individual Email | Traditional

              <*> To change settings online go to:
                 http://groups.yahoo.com/group/bixo-dev/join
                 (Yahoo! ID required)

              <*> To change settings via email:
                 bixo-dev-digest@yahoogroups.com
                 bixo-dev-fullfeatured@yahoogroups.com

              <*> To unsubscribe from this group, send an email to:
                 bixo-dev-unsubscribe@yahoogroups.com

              <*> Your use of Yahoo! Groups is subject to:
                 http://docs.yahoo.com/info/terms/


              --------------------------
              Ken Krugler
              custom big data solutions & training
              Hadoop, Cascading, Mahout & Solr




            • Michele Costabile
              Thank you Ken. Glad to hear you will get back to this when you will be back. I will be on vacation too in the weekend, but I have deadlines approaching and
              Message 6 of 16 , Feb 20, 2012
              View Source
              • 0 Attachment
                Thank you Ken. Glad to hear you will get back to this when you will be back.
                I will be on vacation too in the weekend, but I have deadlines approaching and from monday I have to focus on results.
                I am in the process of creating a custom ContentExtractor, modifying ParsedData, to add an array of Products, TikaCallable, to collect it and modify all the constructors and calls. It have no previous experience with Cascading or Hadoop.
                In the meantime, I am afraid I will revert bixo to the repository head and do some grep in the 'parse' directories to show early results to stakeholders :-).

                > Thanks for sharing your input - often it's hard for me to see where the challenges lie, since I've been using this framework for 2+ years.
                >
                > I also appreciate your offer to help make it more accessible. I'm on vacation right now, but will follow up with you by the end of next week.
              Your message has been successfully submitted and would be delivered to recipients shortly.