3

I have a web scraping application, written in OO Perl. There's single WWW::Mechanize object used in the app. How can I make it to not fetch the same URL twice, i.e. make the second get() with the same URL no-op:

my $mech = WWW::Mechanize->new();
my $url = 'http:://google.com';

$mech->get( $url ); # first time, fetch
$mech->get( $url ); # same url, do nothing
brian d foy
  • 129,424
  • 31
  • 207
  • 592
planetp
  • 14,248
  • 20
  • 86
  • 160

3 Answers3

7

See WWW::Mechanize::Cached:

Synopsis

use WWW::Mechanize::Cached;

my $cacher = WWW::Mechanize::Cached->new;
$cacher->get( $url );

Description

Uses the Cache::Cache hierarchy to implement a caching Mech. This lets one perform repeated requests without hammering a server impolitely.

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
4

You can store the URLs and their content in a hash.

my $mech = WWW::Mechanize->new();
my $url = 'http://google.com';
my %response;

$response{$url} = $mech->get($url) unless $response{$url};
rarbox
  • 486
  • 5
  • 14
2

You can subclass WWW::Mechanize and redefine the get() method to do what you want:

package MyMech;
use base 'WWW::Mechanize';

sub get {
    my $self = shift;
    my($url) = @_;

    if (defined $self->res && $self->res->request->uri ne $url) {
        return $self->SUPER::get(@_)
    }
    return $self->res;
}
Eugene Yarmash
  • 142,882
  • 41
  • 325
  • 378
  • if get() has not been called $self->res is undefined and this throws 'Can't call method "request" on an undefined value' on the first get. Change the 4th line of sub get to if ( !$self->res || $self->res->request->uri ne $url) { to allow get to be called. – MkV Mar 26 '10 at 07:25
  • This will ignore a second request *in succession* for the same URL. I assumed the OP wanted responses to be cached over any interval. – Borodin Jul 15 '12 at 17:05