0

I am using LWP::UserAgent to scrape some pages. To save bandwidth, I don't want to download images and other media on the page, I am only interested in the text.

Can't find anything in the documentation which can help me with this. Please help

GrSrv
  • 551
  • 1
  • 4
  • 22
  • I used `LWP::UserAgent` all the time and it never downloads images, only the text. It also does not download any AJAX content. – shawnhcorey Dec 10 '16 at 15:30

1 Answers1

4

While you don't show any code I assume that you follow in your scraping just the <a href= links but not any <img src= or similar links (i.e. video, css, favicon..) which are obviously images and other types of data you are not interested in.

Unfortunately with a <a href= link it is impossible to find out up front what kind of data this will be. You might make some guess based on a typical suffix of the resource (i.e. image.png) but you can not be sure what it really is. This information you get only once you access the resource, for example by checking the Content-type declared in the response header. LWP offers a way to inspect the response header before downloading the full resource by adding a handler for the response_header phase. From the documentation:

response_header => sub { my($response, $ua, $h) = @_; ... }
This handler is called right after the response headers have been received, but before any content data. The handler might set up handlers for data and might croak to abort the request.

This can be used to stop receiving the response for any non-text content:

my $ua = LWP::UserAgent->new;
$ua->add_handler( response_header => sub {
    my $resp = shift;
    die "no text" if $resp->content_type !~m{^text/};
});
my $resp = $ua->get('http://example.com/some-image.gif');
Steffen Ullrich
  • 114,247
  • 10
  • 131
  • 172