53

I'm implementing a /_status/ endpoint which does some sanity checks on data in our database.

For example, we are collecting measurements and the status should go "bad" if the latest measurement is over an hour old.

I would like to point Pingdom at this URL to leverage their alerting infrastructure and tell us when something's wrong.

On a "good" status I will serve an HTML page with an HTTP 200 OK status. But what would an appropriate HTTP status code be for "bad"? Or would it be more correct not to convey this information via status code, but via HTML content instead?

Thanks!

Paul M Furley
  • 1,005
  • 1
  • 8
  • 12
  • Is your 'bad' status is the result of server failure? If so a 500 might be appropriate – Ken Aug 19 '14 at 17:22
  • Not really, it would normally be the result of a backend processing job failing. It's quite reasonable that the server, database and everything are working perfectly but the data in them is "bad". – Paul M Furley Aug 19 '14 at 17:23
  • Sorry Paul, I should have said 'service failure' rather than 'server failure' – Ken Aug 19 '14 at 17:46
  • 2
    https://tools.ietf.org/html/draft-inadarei-api-health-check-00 RFC has been drafted to standardise these types of responses. Of interest is then /health endpoint and the Content-Type: application/vnd.health+json – ShaunP Mar 05 '18 at 16:07

4 Answers4

71

Well... this is an old question, but I ended up here, so I thought I'd give my two cents here: It seems pretty clear that a 2xx should be returned if all is OK

If health is not OK, I think it should return a 5xx result (4xx talks about the client being at fault in the request; 2xx and 3xx are all successful to some degree).

I think that a 5xx is correct because this is a special request that is answering about the state of the whole service. Also, because most Load Balancers offer liveliness checks based on response codes and not all offer a way to parse a more complex payload (other than perhaps a RegExp Match which can make the check brittle).

I agree with @Julien that a 500 (specifically) doesn't seem appropriate, and we've decided on 503 Service Unavailable.

503 seems to fit for a couple of reasons:

  • It's a 5xx family result code which indicates that something is going on on the server side.
  • It has a temporary nature to it indicating that it may recover.
Paolo
  • 1,463
  • 1
  • 11
  • 11
  • Reading https://tools.ietf.org/html/rfc7231#section-6.6.1 I have the feeling that 503 is rather referring to overload problems. Therefore I would prefer 500 – Michael S Nov 21 '19 at 09:03
20

We just had a similar discussion in our group. We decided for our purposes that the HTTP response codes should be reporting on your server's success or failure to honor the request. For a GET, this would mean whether or not you can respond with the requested resource. In this case, the requested resource is a health report, so as long as you're returning that successfully, it should be a 200 response.

We're returning JSON for our health check, with a top-level "isHealthy" field set to true or false. Our load balancer and other monitors will parse the JSON and use this field to determine if the system is healthy or not.

If you don't want to parse JSON in your monitors, you could try putting a custom response header to indicate binary health of the system, e.g., System-Health: true or System-Health: false. You might have better luck getting monitors which can check that.

If you really want to use a response code, I would recommend an additional endpoint called something like "health" which returns a "204 No Content" when healthy, and a "404 Not Found" when not healthy. In this case, the resource defined by the URL is, symbolically, the health of your system, and so if it's healthy, you can return a successful response. If it's unhealthy, then it's health can't be found, hence the 404.

brianmearns
  • 9,581
  • 10
  • 52
  • 79
  • 15
    Had some follow up discussion on this, and we decided that 4xx errors aren't actually appropriate, because they are meant to indicate a client error, which is not the case here. A 5xx error is more appropriate to indicate unhealthy. This could still be conflated with an error trying to serve the health check, but that's probably ok because it still indicates a problem on the server. I think the ideal situation is a health report in JSON (or other structured data) that is a 200 response whether healthy or not, and rely on the contents of the health report to indicate health of the system. – brianmearns Jun 29 '17 at 14:53
  • 5
    This reasoning is incorrect. HTTP Status Codes have nothing to do with whether the server is able to process the request or not. The mere fact that a response is being sent back indicates that the server was able to process the request. HTTP Status codes are used to indicate the outcome of the processing. The outcome of processing a health check is either healthy or not. A healthy outcome can be indicated with a 2xx status code. An unhealthy outcome can be indicated with a 5xx. An unhealthy server is unable to process future requests to provide the correct outcome. Hence, the 5xx. – Avin Kavish Aug 24 '19 at 04:23
  • The server defines what it's resources mean. It is fine for the server to define it as a client error to ask for health information when the server is unhealthy. 409 Conflict is particularly appropriate. – bwtaylor Oct 28 '19 at 20:23
  • @AvinKavish FWIW you're describing a very simple architecture there. In large systems there are layers of CDNs, caches, load balancers etc between the server and the client. 5xx's are often generated by these intermediate layers - not the server itself. – Ryan Oct 05 '22 at 04:21
  • @brianmearns Could you please update your answer based on the comment? Then it will have better visibility. – Jins Peter Jan 31 '23 at 17:56
4

If your data is 'bad' because there is a service failure (even if that is a backend job failing) then a HTTP 500 seems like a valid response. It indicates that something, somewhere is broken.

It isn't very specific, you're shrugging your shoulders and saying:

The 500 (Internal Server Error) status code indicates that the server encountered an unexpected condition that prevented it from fulfilling the request.

ietf rfc7231

Community
  • 1
  • 1
Ken
  • 77,016
  • 30
  • 84
  • 101
  • 8
    But, on the other end, you successfully gave the status of the service. So the request itself is successful. 500 indicates a problem for answering the request, not the service as a whole. Which is not the case if you can serve the status successfully. – Julien Jun 01 '16 at 18:15
  • I do not think that 500 is the most appropriate code here. A 500 is used for unhandled server exceptions or non-completable requests. In this case however the server is able to complete the healthcheck sucessfully and no unknown error occured, so I think there should be a more fine-grained code used than 500. – Tommy Mar 08 '17 at 13:44
  • @Tommy, how do you feel about a 503 response? – ShaunP Mar 05 '18 at 16:15
  • 1
    @ShaunP I would use a 503 in the case that OP's healthcheck script relies on some external thing, like a database, and that database wasn't reachable. (503 is downstream unreachable). Note that a 503 says "try me again later under the same circumstances and I might work right". It isn't a "permanent error" it is meant for an error that might be transient. – Tommy Mar 05 '18 at 16:53
1

If you ask for health and the server state is not healthy, I'm partial to 409 Conflict which "Indicates that the request could not be processed because of conflict in the current state of the resource" .

Some people might object that if you can respond then the request can be processed, but I disagree. Every error message is a response. The server defines resource semantics. If you ask for the good news resource and the server responds "here is bad news", it didn't give you what it defines to have offered at that resource.

In practice, it's much easier to say 2**="up" 4**="down" and pipe request counts into an availability metric and have a load balancer remove the server from its pool based on the response code. Coming up with ways to argue that "hey, we told you something, so 200 OK" just seems like missing the forrest for the trees to me.

bwtaylor
  • 382
  • 2
  • 5