Is it possible to abort a HTTP request depending on the `Content-Type` when using Perl's LWP?

Question

I want my script to download only text/html content and not binary or images that could take significantly more time to download. I know about the max_size parameter but I would like to add a check on the Content-Type header. Is this doable ?

score 6 · Accepted Answer · answered Jul 30 '12 at 15:27

As pointed out by others you can perform a HEAD request before your GET request. You ought to do this as a way of being polite to the server because it actually is easy for you to abort the connection, but not necessarily easy for the web server to abort sending a bunch of data and doing a bunch of work on its end.

There are some different ways to do this depending on how sophisticated you want to be.

You can send an Accept header with your request which only lists text/html. A well-implemented HTTP server will return a 406 Not Acceptable status if you say you don't accept whatever it is the file is. Of course, they might send it to you anyway. You can do this as your HEAD request as well.

When using a recent version of LWP::UserAgent, you can use a handler subroutine to abort the rest of the request after the headers and before the content body.

use LWP::UserAgent;
use Try::Tiny; 

my $ua = LWP::UserAgent->new;
$ua->add_handler( response_header => sub {
    my($response, $ua, $h) = @_;

    die "Not HTML" unless $response->content_type eq 'text/html';
});

my $url = "http://example.com/foo";

my $html;
my $head_response = $ua->head($url, Accept => "text/html");
if ($head_response->is_success) {
    my $get_response = $ua->get($url, Accept => "text/html");
    if ($get_response->is_success) {
        $html = $get_response->content;
    }
}

See the Handlers section of the LWP::UserAgent documentation for details on handlers.

I haven't caught the exception thrown or made sure to deal with the 406 responses carefully here. I leave that as an exercise for the reader.

I believe some servers might send you the content of the file on a `HEAD` request as well. You probably also need to check for `405 Method Not Allowed` responses on the `HEAD` request and see if you can submit the `GET` request anyway if the `HEAD` fails with that response. — zostay, Jul 30 '12 at 15:31
`HEAD` is not reliable, and it seems to me that anything that "properly" handles a `HEAD` request will also "properly" handle a `Accept` header. I would skip the HEAD request and use only the other two mechanisms (`Accept` and callback). — ikegami, Jul 30 '12 at 15:35
The add_handler() thing does exactly what I was looking for. Thanks ! — mbonnin, Jul 30 '12 at 16:43

score 1 · Answer 2 · answered Jul 30 '12 at 15:01

1

You can use the HEAD request to query the URI's header info. If the server responds to heads, you'll get everything that a GET would have returned, except for that pesky body.

You can then decide what to do based on the MIME type.

otherwise, you'll have to rely on the file's extension, before you request it.

answered Jul 30 '12 at 15:01

Len Jaffe

3,442
1
21
28

This will indeed work but is a bit suboptimal as it doubles the number of requests in the nominal case. I get very few binary links so I would prefer abort these ones rather than adding a test for each link. – mbonnin Jul 30 '12 at 15:20
So why can't you determine file type from file extension before issuing the GET? – Len Jaffe Jul 30 '12 at 16:09
well... I have some cases where there isn't a single extension. Especially cgi scripts that send binary data... So this is not really reliable. – mbonnin Jul 30 '12 at 16:14
You can't tell from the tag whence the URI came? – Len Jaffe Jul 30 '12 at 16:30
what tag ? What I'm doing is a bot that examines the contents of web pages it encounters. I have no control over what type of links it might come into. – mbonnin Jul 30 '12 at 16:42
You can typically discern images from links ( vs ). – Len Jaffe Jul 30 '12 at 16:55
yea but links are out of context so I don't have any or . Anyway, the add_handler() thing is working fine :-) – mbonnin Jul 30 '12 at 16:57

score 0 · Answer 3 · answered Jul 30 '12 at 15:07

If you are using the minimal LWP::Simple subclass of LWP then the head function returns the content type as the first element of a list.

So you can write

use strict;
use warnings;

use LWP::Simple;

for my $url ('http://www.bbc.co.uk') {
  my ($ctype) = head $url;
  my $content = get $url if $ctype eq 'text/html';
}

Is it possible to abort a HTTP request depending on the `Content-Type` when using Perl's LWP?

3 Answers3