I've been testing a small app I've written that basically does a http HEAD request to check whether a page exists, redirects etc. I've noticed that some pages respond differently to HEAD than GET requests. For example:
curl -I http://www.youtube.com/bbcpersian
returns a 404. It's definitely there. Some (quite major) sites even return 500 errors in response to a HEAD - which I'm guessing isn't deliberate.
So my questions are:
- Is there any good reason why certain sites (or pages within sites) would behave like this other than configuration issues or a web master wanting to block bots?
- If I can't rely on a HEAD request am I just left with doing a GET and aborting the request once I have the headers. That feels a bit "wrong" …
While the number of pages that behave like this is small in % terms each false positive is ultimately manually investigated which results in a lot of wasted effort.