Loading ...
Sorry, an error occurred while loading the content.

Re: [PBML] Re: Accessing internet pages using LWP::UserAgent

Expand Messages
  • Charles K. Clarkson
    Rahul Jain rquested: [ reply merged into message ] ... I m not certain how you would get the links without going through the entire web
    Message 1 of 2 , Mar 6, 2002
    • 0 Attachment
      "Rahul Jain" <rahul_jain@...> rquested:

      [ reply merged into message ]

      : : From: b_harnish [mailto:bharnish@...]
      : :
      : : > Now, I want this script to retrieve the webpages recursively ie
      : : > it should not only fetch the parent webpage but also all other
      : : > webpages linked to the parent webpage.
      : : >
      : : > Can anyone give some idea about how this can be done?
      : : >
      : : > I use Perl on Win2K.
      : :
      : : Charles talked a bit on this subject:
      : : http://groups.yahoo.com/group/perl-beginner/message/7527
      : : He suggested taking a look at HTTP::TokeParser.
      : :
      : : Basically, you want to do this:
      : : get_first_page;
      : : put_links_into_array;
      : : loop_throug_array {
      : : get_next_page;
      : : append_links_into_array;
      : : }

      : Yes. This is exactly what I am looking for. But how do
      : I get the list of links? Do I need to parse the whole webpage
      : and leech all the links or is there a simpler way to do this?

      I'm not certain how you would get the links without going
      through the entire web page. The HTML::LinkExtor module
      seems the best choice. Unfortunately, Gisle hasn't updated it
      since 1998. I changed the example a little and came up with
      this.
      While I was testing, I dumped the HTML::LinkExtor object
      and didn't really understand it all. The object could be
      filtered by the type of URI object it is. It might better to
      just subclass this and create a new interface. Unfortunately,
      I don't have time to look at it now. (Oh and I tested this
      from WIN98.)

      use strict;
      use warnings;

      use LWP::UserAgent;
      use HTML::LinkExtor;
      use URI::URL;

      my $url = "http://www.aireo.com/index.htm"; # for instance
      my $ua = new LWP::UserAgent;

      # Set up a callback that collect image links
      my @links;
      sub callback {
      my($tag, %attr) = @_;
      # we only look closer at <a href=". . .">
      return if $tag ne 'a';
      push @links, values %attr;
      }

      # Make the parser.
      # Unfortunately, we don't know the base yet
      # (it might be diffent from $url)
      my $p = HTML::LinkExtor->new(\&callback);

      # Request document and parse it as it arrives
      my $res = $ua->request( HTTP::Request->new( GET => $url ),
      sub {$p->parse($_[0])} );

      # Expand to absolute urls
      my $base = $res->base;
      @links = map { $_ = url($_, $base)->abs } @links;

      # Print them out
      print "$_\n" for @links;

      __END__


      You'll probably want to use a hash instead of an array for
      links, so you don't visit a page twice. You might look at the
      LWP::RobotUA also. The webcrawler link is out of date.
      Spiders do pretty much the same thing as you're describing.

      HTH,
      Charles K. Clarkson
      --
      Clarkson Energy Homes, Inc.
      CJ Web Works - Domains for Real Estate Investors.

      I almost had a psychic girlfriend but she left me before we met.
      - Steven Wright
    Your message has been successfully submitted and would be delivered to recipients shortly.