Essentially, I'm working on writing a service that will help us determine if one of the APIs we access goes down. Every API returns a random 500 error or some other weird thing occasionally, so we don't want to alert the world every time we get a random error. I'm trying to think of the best way to determine if there has been a spike in errors from a particular provider recently.
Suppose I have a service set up that will track the number of errors that happened with a particular service recently, and then wrote a daemon or cron job that will go over those numbers periodically and send an alert if there is a spike in the number of errors a service is giving. How would that daemon determine whether a given service is getting a high number of errors?
The simplest way to do this would be to set a hard limit on the number of errors that have come up and send out an alert when the number of errors goes above that limit. But I have a gut feeling that this is deceptively simple (in other words, it looks easy, but ends up being complex). The main concern I have is with choosing this limit. How do I choose a good limit? How do I make it scale with increased traffic?
Has anyone solved this problem in the past and found a solution that works very well? Are there any well-known algorithms for this? One preference that I would have for a solution: the less data I have to track the better.