I want my script to download only text/html content and not binary or images that could take significantly more time to download. I know about the max_size parameter but I would like to add a check on the Content-Type
header. Is this doable ?
3 Answers
As pointed out by others you can perform a HEAD
request before your GET
request. You ought to do this as a way of being polite to the server because it actually is easy for you to abort the connection, but not necessarily easy for the web server to abort sending a bunch of data and doing a bunch of work on its end.
There are some different ways to do this depending on how sophisticated you want to be.
You can send an
Accept
header with your request which only liststext/html
. A well-implemented HTTP server will return a406 Not Acceptable
status if you say you don't accept whatever it is the file is. Of course, they might send it to you anyway. You can do this as yourHEAD
request as well.When using a recent version of LWP::UserAgent, you can use a handler subroutine to abort the rest of the request after the headers and before the content body.
use LWP::UserAgent; use Try::Tiny; my $ua = LWP::UserAgent->new; $ua->add_handler( response_header => sub { my($response, $ua, $h) = @_; die "Not HTML" unless $response->content_type eq 'text/html'; }); my $url = "http://example.com/foo"; my $html; my $head_response = $ua->head($url, Accept => "text/html"); if ($head_response->is_success) { my $get_response = $ua->get($url, Accept => "text/html"); if ($get_response->is_success) { $html = $get_response->content; } }
See the Handlers section of the LWP::UserAgent documentation for details on handlers.
I haven't caught the exception thrown or made sure to deal with the 406 responses carefully here. I leave that as an exercise for the reader.

- 3,985
- 21
- 30
-
I believe some servers might send you the content of the file on a `HEAD` request as well. You probably also need to check for `405 Method Not Allowed` responses on the `HEAD` request and see if you can submit the `GET` request anyway if the `HEAD` fails with that response. – zostay Jul 30 '12 at 15:31
-
2`HEAD` is not reliable, and it seems to me that anything that "properly" handles a `HEAD` request will also "properly" handle a `Accept` header. I would skip the HEAD request and use only the other two mechanisms (`Accept` and callback). – ikegami Jul 30 '12 at 15:35
-
The add_handler() thing does exactly what I was looking for. Thanks ! – mbonnin Jul 30 '12 at 16:43
You can use the HEAD request to query the URI's header info. If the server responds to heads, you'll get everything that a GET would have returned, except for that pesky body.
You can then decide what to do based on the MIME type.
otherwise, you'll have to rely on the file's extension, before you request it.

- 3,442
- 1
- 21
- 28
-
This will indeed work but is a bit suboptimal as it doubles the number of requests in the nominal case. I get very few binary links so I would prefer abort these ones rather than adding a test for each link. – mbonnin Jul 30 '12 at 15:20
-
So why can't you determine file type from file extension before issuing the GET? – Len Jaffe Jul 30 '12 at 16:09
-
well... I have some cases where there isn't a single extension. Especially cgi scripts that send binary data... So this is not really reliable. – mbonnin Jul 30 '12 at 16:14
-
-
what tag ? What I'm doing is a bot that examines the contents of web pages it encounters. I have no control over what type of links it might come into. – mbonnin Jul 30 '12 at 16:42
-
-
yea but links are out of context so I don't have any
or . Anyway, the add_handler() thing is working fine :-) – mbonnin Jul 30 '12 at 16:57
If you are using the minimal LWP::Simple
subclass of LWP
then the head
function returns the content type as the first element of a list.
So you can write
use strict;
use warnings;
use LWP::Simple;
for my $url ('http://www.bbc.co.uk') {
my ($ctype) = head $url;
my $content = get $url if $ctype eq 'text/html';
}

- 126,100
- 9
- 70
- 144