Reliability of using HEAD Request to Check Web Page Status

Question

I've been testing a small app I've written that basically does a http HEAD request to check whether a page exists, redirects etc. I've noticed that some pages respond differently to HEAD than GET requests. For example:

curl -I http://www.youtube.com/bbcpersian

returns a 404. It's definitely there. Some (quite major) sites even return 500 errors in response to a HEAD - which I'm guessing isn't deliberate.

So my questions are:

Is there any good reason why certain sites (or pages within sites) would behave like this other than configuration issues or a web master wanting to block bots?
If I can't rely on a HEAD request am I just left with doing a GET and aborting the request once I have the headers. That feels a bit "wrong" …

While the number of pages that behave like this is small in % terms each false positive is ultimately manually investigated which results in a lot of wasted effort.

score 8 · Accepted Answer · answered Nov 02 '11 at 21:51

After some time has elapsed and much more investigation I can answer my own questions:

a lot of sites "in the wild" incorrectly respond to HEAD requests. I've had suggestions that some webmasters configure their sites to respond with anything but a 200 to HEAD requests because they consider HEAD requests to be associated with bad bots. I can't validate the reasoning but I can say a large number of sites (or pages on sites - see my original point on youtube) respond incorrectly to a HEAD request.
GET is the only reliable way to check a page really exists (or isn't redirecting etc).

score 1 · Answer 2 · answered May 31 '19 at 06:43

1

The URL you are trying: http://www.youtube.com/bbcpersian is not the correct URL and thus it gives 404.

The correct URL is: https://www.youtube.com/user/BBCPersian and it gives 200.

answered May 31 '19 at 06:43

Jyotsana

11
1

Reliability of using HEAD Request to Check Web Page Status

2 Answers2