3

Anyone know of a way to monitor e-mail alerts scalably?

For many of my on site services I have them e-mail me on success (and failure) of critical tasks. The reason I have it e-mail on success, is sometimes the failure is of a nature that the service can't e-mail a failure alert.

Unfortunately this doesn't scale, I now get so many alerts I don't really monitor them, but I can't afford to alert on failure because that's been too unreliable in the past.

What I would like ideally is a cloud service (or mailbox), something similar to Pingdom, that I can send/forward these alerts too, that will e-mail/sms me when it gets a failure alert, or it's missing success alerts.

Anyone have any ideas?

voretaq7
  • 79,879
  • 17
  • 130
  • 214
Dom
  • 741
  • 1
  • 8
  • 19

1 Answers1

10

What you're proposing is to effectively re-implement your monitoring system (by feeding the current system's alerts into another monitoring system that's smart enough to know something is wrong if it's not constantly reassured that everything is fine).

This almost certainly is not what you need. What you need is a combination of on-site and off-site monitoring that will reliably send you failure alerts when something fails (from the internal system normally, or the external system if for some reason the internal system has failed).


Please bear in mind the following monitoring systems axiom:

There is no good reason to alert on success.

Alerting on success is the most common amateur misconfiguration of a monitoring system.
A monitoring system should only alert you about things that require action.

Success, by definition, is not an event requiring action, so no alert should be generated.
The absence of success is by definition "failure", so an actionable failure alert should be generated.

Sending "everything is fine" status messages eventually trains people to ignore messages from the monitoring system (because no action is required most of the time). You want monitoring alerts to be shocking events that galvanize people into action, not routine nuisances that are deleted from their inbox out of muscle memory.

voretaq7
  • 79,879
  • 17
  • 130
  • 214
  • 2
    The trick to avoid getting the message on success is to have your tasks write/update some kind of status after a correct run. Then you build a check in your monitoring system to make sure that status file has been recently updated and contains the correct results. – Zoredache Dec 03 '12 at 23:56
  • 1
    I whole-heartedly agree with this answer. I once worked for a company that alerted on both success and failure. They employed a support engineer whose sole responsibility was to respond to the several thousand alerts that were generated every day. I asked them if they weren't worried about becoming desensitized to alerts and they said that they weren't. On several occasions I had to fill in for this support engineer and those were some of the worst work days in my professional life. – joeqwerty Dec 04 '12 at 00:02
  • @Zoredache that's a good general solution - unfortunately it relies on being able to get a message out in the event of failure which Dom implies might be a problem (hence my inside/outside suggestion - which in reality should be coupled with the status-file checking process you describe) – voretaq7 Dec 04 '12 at 00:03
  • Yes, you all sound great in principal, but in practise there are some problems: Some of the software, while it writes logs, can't really create me a simple "This pass worked" status file, that can be checked by something like GFI Max RemoteManagement. Also the e-mailed logs contains info like how long the pass took, etc, which is very important. Any internal status monitoring software can go bums up and as all these servers are unattended (small businesses) there's noone to see it. I need an external service that's monitored (cloud / GFI Max). – Dom Dec 04 '12 at 03:45
  • 1
    @Dom None of that invalidates the advice I've given you -- you are trying to patch a bad design by adding more complexity (complexity based on email, which is itself inherently unreliable with no guarantee of timely delivery, or any delivery at all). You should seriously consider redesigning the system to eliminate the inherent "bad" through modifying your logging, parsing the logs, etc. There is a strong culture of *Doing It Right* on Server Fault, and by all appearances you are currently attempting to *Do It Wrong* - if you think we're off the mark update your question to show why... – voretaq7 Dec 04 '12 at 03:52