I've got a business requirement to run through a list of URLs and identify the ones that return an error. I've written a simple script that fetches the header for a particular url since I don't care about the content. I just want to know if there's an error fetching the content. In some cases, my script returns a 503 error while also returning content. Here's one example.
$ curl --head https://www.eia.gov/consumption/
HTTP/1.1 503 Service Unavailable
Server: AkamaiGHost
Mime-Version: 1.0
Content-Type: text/html
Content-Length: 175
Expires: Fri, 05 Jan 2018 21:32:47 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Fri, 05 Jan 2018 21:32:47 GMT
Connection: keep-alive
Running the same curl command without the "--head" part returns a page of HTML and it's not an error page. It's relevant content. So, that 503 error is misleading.
Is this a misconfigured web server returning an incorrect response header or am I missing something?
The real question is this: Is there a reliable way to determine if a URL returns valid content or if it returns an error? The presence of HTML is useful in this case but I wouldn't count on getting HTML back meaning there is not an error. The 404 error is the classic case of getting a page of HTML but the error code tells me that the page wasn't found.