Is there a reliable way to determine if a URL returns content or an error?

Question

I've got a business requirement to run through a list of URLs and identify the ones that return an error. I've written a simple script that fetches the header for a particular url since I don't care about the content. I just want to know if there's an error fetching the content. In some cases, my script returns a 503 error while also returning content. Here's one example.

$ curl --head https://www.eia.gov/consumption/
HTTP/1.1 503 Service Unavailable
Server: AkamaiGHost
Mime-Version: 1.0
Content-Type: text/html
Content-Length: 175
Expires: Fri, 05 Jan 2018 21:32:47 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Fri, 05 Jan 2018 21:32:47 GMT
Connection: keep-alive

Running the same curl command without the "--head" part returns a page of HTML and it's not an error page. It's relevant content. So, that 503 error is misleading.

Is this a misconfigured web server returning an incorrect response header or am I missing something?

The real question is this: Is there a reliable way to determine if a URL returns valid content or if it returns an error? The presence of HTML is useful in this case but I wouldn't count on getting HTML back meaning there is not an error. The 404 error is the classic case of getting a page of HTML but the error code tells me that the page wasn't found.

score 3 · Accepted Answer · answered Jan 05 '18 at 22:21

The --head option makes curl send an actual HTTP HEAD request. Some servers might not honor this or might not route it the same as an HTTP GET request such as a browser would send. Using the -i option will print response headers but still send a GET request. This will also return the entire body of the response. You could cut this down to the first line containing the protocol version and response status only with the head command like so:

$ curl -si https://www.eia.gov/consumption/ | head -n 1
HTTP/1.1 200 OK

(The -s option for curl prevents showing the download status triggered by piping curl to another process. -n option on head is the number of lines to return.)

How to determine success depends on your definition of "valid". HTTP standards consider anything in the 200 or 300 range to be successful. If you wanted to detect based on that you could use grep like so:

$ curl -si https://www.eia.gov/consumption/ | head -n 1 | grep -E 'HTTP/\d\.\d (2|3)\d\d '

This uses a regular expression to match on any return code starting with 2 or 3. Make sure you don't try to match on the HTTP protocol version as it may not always be the same.

Once you have the line returned by curl and head, there is endless possibilities to process, format, and return the results depending on what you actually need.

Interesting! So, I can't rely on getting the same response header if I do HEAD vs. GET. That's too bad since my preference is not to fetch the page. Is your experience that GET is closer to what a browser does? — Sol, Jan 05 '18 at 22:46
If you enter a URL in a browser it surely does a GET request. — virullius, Jan 05 '18 at 23:03

Is there a reliable way to determine if a URL returns content or an error?

1 Answers1