Loading ...
Sorry, an error occurred while loading the content.

Bixo - New user - Help Required in installing and configuring

Expand Messages
  • Natarajan
    I am highly impressed with this product. I downloaded and installed the same in my system . I am able to run the Simple Crawler as documented in the Getting
    Message 1 of 6 , Feb 25, 2011
    View Source
    • 0 Attachment
      I am highly impressed with this product. I downloaded and installed the same in my system . I am able to run the Simple Crawler as documented in the Getting Started Section. But after that I am unable to find how to use this software for web data mining. What are the inputs? What are the outputs?
      How can we change configuration to get the desired output? I am unable to find answers for these kind of questions. Can u please point me in the right direction to use this product?

      Thanks

      R.Natarajan
    • Vivek Magotra
      Hi Natarajan, Thanks :-) Bixo is a toolkit and not a product so there are no predefined inputs and outputs. The general usage pattern involves writing
      Message 2 of 6 , Feb 26, 2011
      View Source
      • 0 Attachment
        Hi Natarajan,

        Thanks :-)

        Bixo is a toolkit and not a product so there are no predefined inputs and outputs.
        The general usage pattern involves writing Cascading workflows (so you definitely want to read up on that -- http://www.cascading.org/). 
        You could use parts of Bixo (like the fetcher or parser) without Cascading as well.

        The SimpleCrawlTool is one example of using Bixo - you could clone it and make changes to that as a starting point for your own workflow.

        Good luck!

        --vivek


        On Sat, Feb 26, 2011 at 2:05 PM, Natarajan <natarajansr_mdu@...> wrote:
         


        I am highly impressed with this product. I downloaded and installed the same in my system . I am able to run the Simple Crawler as documented in the Getting Started Section. But after that I am unable to find how to use this software for web data mining. What are the inputs? What are the outputs?
        How can we change configuration to get the desired output? I am unable to find answers for these kind of questions. Can u please point me in the right direction to use this product?

        Thanks

        R.Natarajan


      • Ken Krugler
        Hi Natarajan, ... Something else you should look at is the helpful example in Bixo. This shows how to use Bixo to extract links to mailing list archives from
        Message 3 of 6 , Feb 27, 2011
        View Source
        • 0 Attachment
          Hi Natarajan,

          I am highly impressed with this product. I downloaded and installed the same in my system . I am able to run the Simple Crawler as documented in the Getting Started Section. But after that I am unable to find how to use this software for web data mining. What are the inputs? What are the outputs?
          How can we change configuration to get the desired output? I am unable to find answers for these kind of questions. Can u please point me in the right direction to use this product?


          Something else you should look at is the "helpful" example in Bixo. This shows how to use Bixo to extract links to mailing list archives from a page on Apache, then process emails from the mailing list using Cascading workflows.

          -- Ken

          --------------------------
          Ken Krugler
          +1 530-210-6378
          e l a s t i c   w e b   m i n i n g





        • Natarajan
          Thanks Ken and Vivek for your replies. Now I am able to run the bixo from eclipse and crawl the cnn.com. I have generated API document using javadoc and
          Message 4 of 6 , Feb 27, 2011
          View Source
          • 0 Attachment
            Thanks Ken and Vivek for your replies.

            Now I am able to run the bixo from eclipse and crawl the cnn.com.


            I have generated API document using javadoc and Doxygen using eclox
            plugin in eclipse.

            These documents may be useful for understanding the code.

            I am giving the summary of my understanding for the benefit of other
            new comers to Bixo.

            Please correct them if I am wrong .

            Hadoop is the base component in Bixo.

            it is used storing crawled, parsed web pages.

            Cascading is used to interface Hadoop.

            Hadoop can be used in Linux systems only

            But we can test the bixo in windows system if cygwin or msys is
            installed.

            Bixo contains the code for fetching web pages , extracting the outlinks
            in the fetched html pages and parsing the downloaded pages.

            In hadoop file format, they are stored in crawldb, content, parse
            folders.

            They can be examined or viewed by Hadoop file access methods or tools.

            There is one free eclipse plug in available for working with Hadoop
            files in the url
            http://www.karmasphere.com/Products-Information/karmasphere-studio-commu\
            nity-edition.html.

            I am able to examine all Hadoop files created in Bixo except the
            crawldb.

            Now I am trying to examine the crawldb with Karmasphere or some other
            java coding or tool.

            In short, Bixo downloads the web pages, extracts the text from them and
            they are stored in Hadoop file system.

            We can change the flow or enhance it for our specific requirement by
            changing the java code.

            This developer forum (bixo-dev@yahoogroups.com) contains a lot of useful
            information for developers.


            Thanks

            R.Natarajan

            --- In bixo-dev@yahoogroups.com, Ken Krugler <KKrugler_lists@...> wrote:
            >
            > Hi Natarajan,
            >
            > > I am highly impressed with this product. I downloaded and installed
            > > the same in my system . I am able to run the Simple Crawler as
            > > documented in the Getting Started Section. But after that I am
            > > unable to find how to use this software for web data mining. What
            > > are the inputs? What are the outputs?
            > > How can we change configuration to get the desired output? I am
            > > unable to find answers for these kind of questions. Can u please
            > > point me in the right direction to use this product?
            > >
            >
            > Something else you should look at is the "helpful" example in Bixo.
            > This shows how to use Bixo to extract links to mailing list archives
            > from a page on Apache, then process emails from the mailing list using
            > Cascading workflows.
            >
            > -- Ken
            >
            > --------------------------
            > Ken Krugler
            > +1 530-210-6378
            > http://bixolabs.com
            > e l a s t i c w e b m i n i n g
            >
          • Ken Krugler
            ... Yes ... It is used to run the crawling & parsing workflows. The Hadoop file system (HDFS) can be used to save intermediate and final results. ... Cascading
            Message 5 of 6 , Feb 28, 2011
            View Source
            • 0 Attachment

              On Feb 27, 2011, at 11:51pm, Natarajan wrote:

              Thanks Ken and Vivek for your replies.

              Now I am able to run the bixo from eclipse and crawl the cnn.com.

              I have generated API document using javadoc and Doxygen using eclox
              plugin in eclipse.

              These documents may be useful for understanding the code.

              I am giving the summary of my understanding for the benefit of other
              new comers to Bixo.

              Please correct them if I am wrong .

              Hadoop is the base component in Bixo.

              Yes

              it is used storing crawled, parsed web pages.

              It is used to run the crawling & parsing workflows. The Hadoop file system (HDFS) can be used to save intermediate and final results.

              Cascading is used to interface Hadoop.

              Cascading is used to define the workflows, and translates these into a series of Hadoop jobs.

              Hadoop can be used in Linux systems only

              Essentially true.

              But we can test the bixo in windows system if cygwin or msys is
              installed.

              Yes, re cygwin. Don't know about msys

              Bixo contains the code for fetching web pages , extracting the outlinks
              in the fetched html pages and parsing the downloaded pages.

              Bixo provides Cascading components to handle fetching web pages (including processing robots.txt), parsing pages (via Tika), extracting outlinks, etc.


              In hadoop file format, they are stored in crawldb, content, parse
              folders.

              They can be stored using Hadoop sequence files, though any format supported via Cascading taps (text, Avro, SQL database, Amazon SimpleDB, Lucene index, etc) will work.

              They can be examined or viewed by Hadoop file access methods or tools.

              There is one free eclipse plug in available for working with Hadoop
              files in the url 
              http://www.karmasphere.com/Products-Information/karmasphere-studio-commu\
              nity-edition.html.


              I am able to examine all Hadoop files created in Bixo except the
              crawldb.

              Now I am trying to examine the crawldb with Karmasphere or some other
              java coding or tool.

              In short, Bixo downloads the web pages, extracts the text from them and
              they are stored in Hadoop file system.

              We can change the flow or enhance it for our specific requirement by
              changing the java code.

              Generally this is all true.

              Though the key point is that Bixo is a toolkit for building web crawling & mining workflows. Much of what you describe above is how the sample workflow (the SimpleCrawlTool) operates, but that's just a quick example/starting point.

              Regards,

              -- Ken


              This developer forum (bixo-dev@yahoogroups.com) contains a lot of useful
              information for developers.

              Thanks

              R.Natarajan

              --- In bixo-dev@yahoogroups.com, Ken Krugler <KKrugler_lists@...> wrote:
              >
              > Hi Natarajan,
              >
              > > I am highly impressed with this product. I downloaded and installed
              > > the same in my system . I am able to run the Simple Crawler as
              > > documented in the Getting Started Section. But after that I am
              > > unable to find how to use this software for web data mining. What
              > > are the inputs? What are the outputs?
              > > How can we change configuration to get the desired output? I am
              > > unable to find answers for these kind of questions. Can u please
              > > point me in the right direction to use this product?
              > >
              >
              > Something else you should look at is the "helpful" example in Bixo.
              > This shows how to use Bixo to extract links to mailing list archives
              > from a page on Apache, then process emails from the mailing list using
              > Cascading workflows.
              >
              > -- Ken
              >
              > --------------------------
              > Ken Krugler
              > +1 530-210-6378
              > http://bixolabs.com
              > e l a s t i c w e b m i n i n g
              >


              --------------------------
              Ken Krugler
              +1 530-210-6378
              e l a s t i c   w e b   m i n i n g





            • Natarajan
              Thanks ken for correcting the mistakes in my comments. All these information are very useful in developing an application using the Bixo Toolkit for Web Data
              Message 6 of 6 , Mar 1, 2011
              View Source
              • 0 Attachment
                Thanks ken for correcting the mistakes in my comments.

                All these information are very useful in developing an application using
                the Bixo Toolkit for Web Data Mining.

                Now I have started exploring this gradually.

                Once again thanking you

                R.Natarajan


                --- In bixo-dev@yahoogroups.com, Ken Krugler <KKrugler_lists@...> wrote:
                >
                >
                > On Feb 27, 2011, at 11:51pm, Natarajan wrote:
                >
                > > Thanks Ken and Vivek for your replies.
                > >
                > > Now I am able to run the bixo from eclipse and crawl the cnn.com.
                > >
                > > I have generated API document using javadoc and Doxygen using eclox
                > > plugin in eclipse.
                > >
                > > These documents may be useful for understanding the code.
                > >
                > > I am giving the summary of my understanding for the benefit of other
                > > new comers to Bixo.
                > >
                > > Please correct them if I am wrong .
                > >
                > > Hadoop is the base component in Bixo.
                > >
                > Yes
                >
                > > it is used storing crawled, parsed web pages.
                > >
                > It is used to run the crawling & parsing workflows. The Hadoop file
                > system (HDFS) can be used to save intermediate and final results.
                >
                > > Cascading is used to interface Hadoop.
                > >
                > Cascading is used to define the workflows, and translates these into a
                > series of Hadoop jobs.
                >
                > > Hadoop can be used in Linux systems only
                > >
                > Essentially true.
                >
                > > But we can test the bixo in windows system if cygwin or msys is
                > > installed.
                > >
                > Yes, re cygwin. Don't know about msys
                >
                > > Bixo contains the code for fetching web pages , extracting the
                > > outlinks
                > > in the fetched html pages and parsing the downloaded pages.
                > >
                > Bixo provides Cascading components to handle fetching web pages
                > (including processing robots.txt), parsing pages (via Tika),
                > extracting outlinks, etc.
                >
                > >
                > > In hadoop file format, they are stored in crawldb, content, parse
                > > folders.
                > >
                > They can be stored using Hadoop sequence files, though any format
                > supported via Cascading taps (text, Avro, SQL database, Amazon
                > SimpleDB, Lucene index, etc) will work.
                >
                > > They can be examined or viewed by Hadoop file access methods or
                tools.
                > >
                > > There is one free eclipse plug in available for working with Hadoop
                > > files in the url
                > >
                http://www.karmasphere.com/Products-Information/karmasphere-studio-commu
                > > \
                > > nity-edition.html.
                > >
                > > I am able to examine all Hadoop files created in Bixo except the
                > > crawldb.
                > >
                > > Now I am trying to examine the crawldb with Karmasphere or some
                other
                > > java coding or tool.
                > >
                > > In short, Bixo downloads the web pages, extracts the text from them
                > > and
                > > they are stored in Hadoop file system.
                > >
                > > We can change the flow or enhance it for our specific requirement by
                > > changing the java code.
                > >
                > Generally this is all true.
                >
                > Though the key point is that Bixo is a toolkit for building web
                > crawling & mining workflows. Much of what you describe above is how
                > the sample workflow (the SimpleCrawlTool) operates, but that's just a
                > quick example/starting point.
                >
                > Regards,
                >
                > -- Ken
                >
                > >
                > > This developer forum (bixo-dev@yahoogroups.com) contains a lot of
                > > useful
                > > information for developers.
                > >
                > > Thanks
                > >
                > > R.Natarajan
                > >
                > > --- In bixo-dev@yahoogroups.com, Ken Krugler KKrugler_lists@
                > > wrote:
                > > >
                > > > Hi Natarajan,
                > > >
                > > > > I am highly impressed with this product. I downloaded and
                > > installed
                > > > > the same in my system . I am able to run the Simple Crawler as
                > > > > documented in the Getting Started Section. But after that I am
                > > > > unable to find how to use this software for web data mining.
                What
                > > > > are the inputs? What are the outputs?
                > > > > How can we change configuration to get the desired output? I am
                > > > > unable to find answers for these kind of questions. Can u please
                > > > > point me in the right direction to use this product?
                > > > >
                > > >
                > > > Something else you should look at is the "helpful" example in
                Bixo.
                > > > This shows how to use Bixo to extract links to mailing list
                archives
                > > > from a page on Apache, then process emails from the mailing list
                > > using
                > > > Cascading workflows.
                > > >
                > > > -- Ken
                > > >
                > > > --------------------------
                > > > Ken Krugler
                > > > +1 530-210-6378
                > > > http://bixolabs.com
                > > > e l a s t i c w e b m i n i n g
                > > >
                > >
                > >
                > >
                >
                > --------------------------
                > Ken Krugler
                > +1 530-210-6378
                > http://bixolabs.com
                > e l a s t i c w e b m i n i n g
                >
              Your message has been successfully submitted and would be delivered to recipients shortly.