How can I determine if there was a spike in errors recently?

Question

Essentially, I'm working on writing a service that will help us determine if one of the APIs we access goes down. Every API returns a random 500 error or some other weird thing occasionally, so we don't want to alert the world every time we get a random error. I'm trying to think of the best way to determine if there has been a spike in errors from a particular provider recently.

Suppose I have a service set up that will track the number of errors that happened with a particular service recently, and then wrote a daemon or cron job that will go over those numbers periodically and send an alert if there is a spike in the number of errors a service is giving. How would that daemon determine whether a given service is getting a high number of errors?

The simplest way to do this would be to set a hard limit on the number of errors that have come up and send out an alert when the number of errors goes above that limit. But I have a gut feeling that this is deceptively simple (in other words, it looks easy, but ends up being complex). The main concern I have is with choosing this limit. How do I choose a good limit? How do I make it scale with increased traffic?

Has anyone solved this problem in the past and found a solution that works very well? Are there any well-known algorithms for this? One preference that I would have for a solution: the less data I have to track the better.

score 3 · Accepted Answer · answered Feb 25 '11 at 23:40

How about try to approach the issue from mathematical point of view. I assume that you already have some data collected (how many exceptions occur per day) for some time. So you can figure out from that data the statistical distribution (probably normal (bell curve)) and whenever the number of exception will go over something like 1.5 standard deviation above the normal, fire an alert.

In other words, try to figure out what is the normal number of exceptions and if your system will go over that number by 1 standard deviation or so trigger an alarm.

How can I determine if there was a spike in errors recently?

1 Answers1