Loading ...
Sorry, an error occurred while loading the content.

Re: [archive-crawler] about embedding heritrix..

Expand Messages
  • Mert Çalışkan
    Hey Michael, -If we had the chance, it would be nice to have a war file with the released heritrix.jar dependency which is served on sourceforge. -The problem
    Message 1 of 6 , Dec 28, 2006
    View Source
    • 0 Attachment
      Hey Michael,
       
      -If we had the chance, it would be nice to have a war file with the released heritrix.jar dependency which is served on sourceforge.
       
      -The problem is with the manifest file in the jar file. Cause of the resources defined in the manifest file, catalina tries to validate them
      and errors occurs. Here is an excerpt below from the log file.


      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "je" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "commons-httpclient-local" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "commons-lang-local" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "commons-logging-local" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "commons-net-local" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "commons-codec" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "dnsjava" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "jetty" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "servlet" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "jasper-runtime" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "jasper-compiler" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "poi" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][ heritrix-1.10.1.jar]: Required extension "poi-scratchpad" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][ heritrix-1.10.1.jar]: Required extension "javaswf" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar ]: Required extension "itext" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "ant-local" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "junit-local" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "commons-collections-local" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "commons-cli" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "mg4j-local" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "fastutil-local" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "libidn" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp][heritrix-1.10.1.jar]: Required extension "beanshell" not found.
      Dec 28, 2006 12:11:39 PM org.apache.catalina.util.ExtensionValidator validateManifestResources
      INFO: ExtensionValidator[/HeritrixWebapp]: Failure to find 23 required extension(s).
      Dec 28, 2006 12:11:39 PM org.apache.catalina.core.StandardContext start
      SEVERE: Error getConfigured
       
      - I've checked out the wiki-page you pointed. While trying to instantiate Heritrix with constructor-arg true, I get exception that says

      javax.management.InstanceAlreadyExistsException

      : org.archive.crawler:jmxport=8849,name=Heritrix,type=CrawlService,guiport=8080,host=HAL

      while trying to register in Heritrix.java,  server.registerMBean(objToRegister, objName);

      But I haven't registered it within another standalone or embedded app. Any clues?
        

      Regards,
       
       
       
      On 12/28/06, Michael Stack <stack@...
      > wrote:

      Mert Çalışkan wrote:

      Hi,
       
      I've downloaded heritrix.war from cruisecontrol latest builds to look what's inside.
      Actually there is no heritrix.jar under web-inf/lib. (there class files for heritrix, commons-httpclient, commons-pool).
      What is the reason for this?






      None other than the WAR generation is based on maven (1.0.2) WAR goal and this is how it does the assembly.



       
      and when i tried to embed the heritrix inside my webapp, the manifest.mf caused some trouble (on tomcat).
      The dependencies declared there causes the container to throw severe errors.




      You mean the Heritrix JAR or WAR manifest?  Both just list info about dependencies (but, yeah, in the past I've come across strange issues w/ MANIFEST.MF content formats).  What kind of complaints are you seeing?  We don't use the WAR format around here much so you'll have to help us out if theres' issues.


       
      And is there any further steps for embedding heritrix inside a webapp other than implementing some contextListener?




      Checkout: http://crawler.archive.org/cgi-bin/wiki.pl?EmbeddingHeritrix.

      Yours,
      St.Ack


       
      Thanks in advance...
       
      Mert..


    • Michael Stack
      ... OK. Build is about to be refactored. We re moving off maven 1.x and on to 2.x (and from CVS to SVN). Will consider it then (Meantime made an issue:
      Message 2 of 6 , Dec 28, 2006
      View Source
      • 0 Attachment
        Mert Çalışkan wrote:
        Hey Michael,
         
        -If we had the chance, it would be nice to have a war file with the released heritrix.jar dependency which is served on sourceforge.





        OK.  Build is about to be refactored.  We're moving off maven 1.x and on to 2.x (and from CVS to SVN).  Will consider it then (Meantime made an issue: http://sourceforge.net/tracker/index.php?func=detail&aid=1623770&group_id=73833&atid=539102).


         
        -The problem is with the manifest file in the jar file. Cause of the resources defined in the manifest file, catalina tries to validate them
        and errors occurs. Here is an excerpt below from the log file.


         




        I just committed the following change to project.properties:

        -maven.jar.manifest.extensions.add = true
        +maven.jar.manifest.extensions.add = false

        Try builds post build.42 (See under the 'build artifacts' link here: http://builds.archive.org:8080/cruisecontrol/buildresults/HEAD-heritrix).



        - I've checked out the wiki-page you pointed. While trying to instantiate Heritrix with constructor- arg true, I get exception that says

        javax.management. InstanceAlreadyE xistsException

        : org.archive. crawler:jmxport=8849,name=Heritrix, type=CrawlServic e,guiport=8080,host=HAL

        while trying to register in Heritrix.java,  server.registerMBea n(objToRegister, objName);

        But I haven't registered it within another standalone or embedded app. Any clues?
          










        I just tried the Heritrix WAR and seems to come up fine so the above seems a little odd.  Heritrix logs its registration w/ the JMX Agent as follows:

        Dec 28, 2006 8:58:07 AM org.archive.crawler.Heritrix postRegister
        INFO: org.archive.crawler:guiport=8080,host=debord,name=Heritrix,type=CrawlService registered to MBeanServerId=debord_1167325084008, SpecificationVersion=1.2 Maintenance Release, ImplementationVersion=1.5.0_08-b03, SpecificationVendor=Sun Microsystems

        Anything in the log ahead of your Heritrix construction?  (No chance you've left the heritrix.war under your tomcat webapps directory and its registration is clashing w/ your attempted construction?)  Can you do a listing on your tomcat JMX Agent to see extant beans to see if already a Heritrix bean present?  What happens if you try the constructor that takes a name  (plus boolean)?

        St.Ack



        Regards,
         
         
         
        On 12/28/06, Michael Stack <stack@archive. org > wrote:

        Mert Çalışkan wrote:

        Hi,
         
        I've downloaded heritrix.war from cruisecontrol latest builds to look what's inside.
        Actually there is no heritrix.jar under web-inf/lib. (there class files for heritrix, commons-httpclient, commons-pool) .
        What is the reason for this?






        None other than the WAR generation is based on maven (1.0.2) WAR goal and this is how it does the assembly.



         
        and when i tried to embed the heritrix inside my webapp, the manifest.mf caused some trouble (on tomcat).
        The dependencies declared there causes the container to throw severe errors.




        You mean the Heritrix JAR or WAR manifest?  Both just list info about dependencies (but, yeah, in the past I've come across strange issues w/ MANIFEST.MF content formats).  What kind of complaints are you seeing?  We don't use the WAR format around here much so you'll have to help us out if theres' issues.


         
        And is there any further steps for embedding heritrix inside a webapp other than implementing some contextListener?




        Checkout: http://crawler. archive.org/ cgi-bin/wiki. pl?EmbeddingHeri trix.

        Yours,
        St.Ack


         
        Thanks in advance...
         
        Mert..



      • Michael Stack
        I should warn that Heritrix HEAD may be a little unstable at the moment (A possible file handle leak. TBD). Yours, St.Ack ... I should warn that Heritrix HEAD
        Message 3 of 6 , Dec 28, 2006
        View Source
        • 0 Attachment
          I should warn that Heritrix HEAD may be a little unstable at the moment (A possible file handle leak. TBD).
          Yours,
          St.Ack


          Michael Stack wrote:

          Mert Çalışkan wrote:

          Hey Michael,
           
          -If we had the chance, it would be nice to have a war file with the released heritrix.jar dependency which is served on sourceforge.





          OK.  Build is about to be refactored.  We're moving off maven 1.x and on to 2.x (and from CVS to SVN).  Will consider it then (Meantime made an issue: http://sourceforge. net/tracker/ index.php? func=detail&aid=1623770&group_id=73833&atid=539102).


           
          -The problem is with the manifest file in the jar file. Cause of the resources defined in the manifest file, catalina tries to validate them
          and errors occurs. Here is an excerpt below from the log file.


           




          I just committed the following change to project.properties:

          -maven.jar.manifest .extensions. add = true
          +maven.jar.manifest .extensions. add = false

          Try builds post build.42 (See under the 'build artifacts' link here: http://builds. archive.org: 8080/cruisecontr ol/buildresults/ HEAD-heritrix).



          - I've checked out the wiki-page you pointed. While trying to instantiate Heritrix with constructor- arg true, I get exception that says

          javax.management. InstanceAlreadyE xistsException

          : org.archive. crawler:jmxport=8849,name=Heritrix, type=CrawlServic e,guiport=8080,host=HAL

          while trying to register in Heritrix.java,  server.registerMBea n(objToRegister, objName);

          But I haven't registered it within another standalone or embedded app. Any clues?
            










          I just tried the Heritrix WAR and seems to come up fine so the above seems a little odd.  Heritrix logs its registration w/ the JMX Agent as follows:

          Dec 28, 2006 8:58:07 AM org.archive. crawler.Heritrix postRegister
          INFO: org.archive. crawler:guiport= 8080,host= debord,name= Heritrix, type=CrawlServic e registered to MBeanServerId= debord_116732508 4008, SpecificationVersio n=1.2 Maintenance Release, ImplementationVersi on=1.5.0_ 08-b03, SpecificationVendor =Sun Microsystems

          Anything in the log ahead of your Heritrix construction?  (No chance you've left the heritrix.war under your tomcat webapps directory and its registration is clashing w/ your attempted construction? )  Can you do a listing on your tomcat JMX Agent to see extant beans to see if already a Heritrix bean present?  What happens if you try the constructor that takes a name  (plus boolean)?

          St.Ack



          Regards,
           
           
           
          On 12/28/06, Michael Stack <stack@archive. org > wrote:

          Mert Çalışkan wrote:

          Hi,
           
          I've downloaded heritrix.war from cruisecontrol latest builds to look what's inside.
          Actually there is no heritrix.jar under web-inf/lib. (there class files for heritrix, commons-httpclient, commons-pool) .
          What is the reason for this?






          None other than the WAR generation is based on maven (1.0.2) WAR goal and this is how it does the assembly.



           
          and when i tried to embed the heritrix inside my webapp, the manifest.mf caused some trouble (on tomcat).
          The dependencies declared there causes the container to throw severe errors.




          You mean the Heritrix JAR or WAR manifest?  Both just list info about dependencies (but, yeah, in the past I've come across strange issues w/ MANIFEST.MF content formats).  What kind of complaints are you seeing?  We don't use the WAR format around here much so you'll have to help us out if theres' issues.


           
          And is there any further steps for embedding heritrix inside a webapp other than implementing some contextListener?




          Checkout: http://crawler. archive.org/ cgi-bin/wiki. pl?EmbeddingHeri trix.

          Yours,
          St.Ack


           
          Thanks in advance...
           
          Mert..




      • Mert Çalışkan
        Hey Michael, I got it working with the Heritrix(String, boolean) constructor...seen the info messages thay you stated. (I am using embedded tomcat within my
        Message 4 of 6 , Dec 28, 2006
        View Source
        • 0 Attachment
          Hey Michael,
           
          I got it working with the Heritrix(String, boolean) constructor...seen the info messages thay you stated.
          (I am using embedded tomcat within my eclipse, so there will be no other apps deployed.)
           
          with the boolean argumented-only consturctor I get the exception below..

          java.lang.RuntimeException

          : javax.management.InstanceAlreadyExistsException: org.archive.crawler:jmxport=8849 ,name=Heritrix,type=CrawlService,guiport=8080,host=HAL
          at
          org.archive.crawler.Heritrix.<init>(
          Heritrix.java:450)
          at org.archive.crawler.Heritrix .<init>(
          Heritrix.java:387)
          at org.archive.crawler.Heritrix.<init>(
          Heritrix.java:375 )
          at com.ontometrics.dev.listener.HeritrixContextListener.contextInitialized(
          HeritrixContextListener.java:24)
          at org.apache.catalina.core.StandardContext.listenerStart (
          StandardContext.java:3763)
          at org.apache.catalina.core.StandardContext.start(
          StandardContext.java:4211 )
          at org.apache.catalina.core.ContainerBase.start(
          ContainerBase.java:1013)
          at org.apache.catalina.core.StandardHost.start(
          StandardHost.java:718)
          at org.apache.catalina.core.ContainerBase.start(
          ContainerBase.java:1013 )
          at org.apache.catalina.core.StandardEngine.start(
          StandardEngine.java:442)
          at org.apache.catalina.core.StandardService.start(
          StandardService.java:450)
          at org.apache.catalina.core.StandardServer.start(
          StandardServer.java:709 )
          at org.apache.catalina.startup.Catalina.start(
          Catalina.java:551)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(
          Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
          at java.lang.reflect.Method.invoke (Unknown Source)
          at org.apache.catalina.startup.Bootstrap.start(
          Bootstrap.java:294)
          at org.apache.catalina.startup.Bootstrap.main(
          Bootstrap.java:432)


          now its time to create some jobs for crawling..time to dive into the javadocs :)
          Regards,
           
          Mert..
           


           
          On 12/28/06, Michael Stack <stack@...> wrote:

          I should warn that Heritrix HEAD may be a little unstable at the moment (A possible file handle leak. TBD).
          Yours,
          St.Ack




          Michael Stack wrote:

          Mert Çalışkan wrote:

          Hey Michael,
           
          -If we had the chance, it would be nice to have a war file with the released heritrix.jar dependency which is served on sourceforge.





          OK.  Build is about to be refactored.  We're moving off maven 1.x and on to 2.x (and from CVS to SVN).  Will consider it then (Meantime made an issue: http://sourceforge.net/tracker/index.php?func=detail&aid=1623770&group_id=73833&atid=539102).


           
          -The problem is with the manifest file in the jar file. Cause of the resources defined in the manifest file, catalina tries to validate them
          and errors occurs. Here is an excerpt below from the log file.


           




          I just committed the following change to project.properties:

          -maven.jar.manifest.extensions.add = true
          +maven.jar.manifest.extensions.add = false

          Try builds post build.42 (See under the 'build artifacts' link here: http://builds.archive.org:8080/cruisecontrol/buildresults/HEAD-heritrix).



          - I've checked out the wiki-page you pointed. While trying to instantiate Heritrix with constructor-arg true, I get exception that says

          javax.management.InstanceAlreadyExistsException

          : org.archive.crawler:jmxport=8849,name=Heritrix,type=CrawlService,guiport=8080,host=HAL

          while trying to register in Heritrix.java,  server.registerMBean(objToRegister, objName);

          But I haven't registered it within another standalone or embedded app. Any clues?
            










          I just tried the Heritrix WAR and seems to come up fine so the above seems a little odd.  Heritrix logs its registration w/ the JMX Agent as follows:

          Dec 28, 2006 8:58:07 AM org.archive.crawler.Heritrix postRegister
          INFO: org.archive.crawler:guiport=8080,host=debord,name=Heritrix,type=CrawlService registered to MBeanServerId=debord_1167325084008, SpecificationVersion= 1.2 Maintenance Release, ImplementationVersion=1.5.0_08-b03, SpecificationVendor=Sun Microsystems

          Anything in the log ahead of your Heritrix construction?  (No chance you've left the heritrix.war under your tomcat webapps directory and its registration is clashing w/ your attempted construction?)  Can you do a listing on your tomcat JMX Agent to see extant beans to see if already a Heritrix bean present?  What happens if you try the constructor that takes a name  (plus boolean)?

          St.Ack



          Regards,
           
           
           
          On 12/28/06, Michael Stack <stack@... > wrote:

          Mert Çalışkan wrote:

          Hi,
           
          I've downloaded heritrix.war from cruisecontrol latest builds to look what's inside.
          Actually there is no heritrix.jar under web-inf/lib. (there class files for heritrix, commons-httpclient, commons-pool).
          What is the reason for this?






          None other than the WAR generation is based on maven (1.0.2) WAR goal and this is how it does the assembly.



           
          and when i tried to embed the heritrix inside my webapp, the manifest.mf caused some trouble (on tomcat).
          The dependencies declared there causes the container to throw severe errors.




          You mean the Heritrix JAR or WAR manifest?  Both just list info about dependencies (but, yeah, in the past I've come across strange issues w/ MANIFEST.MF content formats).  What kind of complaints are you seeing?  We don't use the WAR format around here much so you'll have to help us out if theres' issues.


           
          And is there any further steps for embedding heritrix inside a webapp other than implementing some contextListener?




          Checkout: http://crawler.archive.org/cgi-bin/wiki.pl?EmbeddingHeritrix.

          Yours,
          St.Ack


           
          Thanks in advance...
           
          Mert..


           

           


        • Your message has been successfully submitted and would be delivered to recipients shortly.