0

We have a pretty simple system setup, where I get text messages when there is a system problem. It's nothing fancy. I send an email to my phone number within my logging class for alert levels. It works well enough, but it has one major flaw: A small hiccup in the system/site can turn into dozens of rapid fire text messages. Sometimes non-stop text messages until I log into the system and fix the problem.

So I'm looking for pointers on software or services I can use that deal with alerts in a smarter way. Perhaps something that only sends me alerts X number of times within Y number of minutes. I'm not looking for a full monitoring suite. We already deal with that in house. I'm only looking to tackle this single problem.

mellowsoon
  • 103
  • 2
  • What monitoring system are you currently using? – JakeRobinson Feb 21 '11 at 05:27
  • @Jake - It's hard to say. I'm not the sysadmin, I'm the web guy. I only came here because our sysadmin has been blowing me off on giving me a hand with this problem. I personally use my own home brewed scripts that monitor the site and the back end services it relies on, and I send out alerts through my logging class when something is wrong. I'm not sure what he uses to monitor the network and servers. – mellowsoon Feb 21 '11 at 07:07

2 Answers2

5

The answer to such questions is most often Nagios. The alerting options are as flexible as it's monitoring capabilities. Configure it to send only the alerts you want and no more.

John Gardeniers
  • 27,458
  • 12
  • 55
  • 109
2

I create small bash script for you:

#!/bin/bash

COUNT_FILE="/tmp/count"
TIME_FILE="/tmp/time"
MAX_SEND=1             #max message
TIME_INT=300            #time interval in second


send () {
        ERROR_TYPE="_$1"
        MESSAGE=$2
        [ -e ${TIME_FILE}${ERROR_TYPE} ] || touch ${TIME_FILE}${ERROR_TYPE}
        [ -e ${COUNT_FILE}${ERROR_TYPE} ] || echo 0 > ${COUNT_FILE}${ERROR_TYPE}
        if [ $(( $(date +%s) - $(date +%s -r ${TIME_FILE}${ERROR_TYPE}) )) -gt $TIME_INT ];
        then
                COUNT=0
                touch ${TIME_FILE}${ERROR_TYPE}
        else
                COUNT=`cat ${COUNT_FILE}${ERROR_TYPE}`
        fi
        if [ $COUNT -lt $MAX_SEND ];
        then
                echo "$MESSAGE";
                #real send message
        fi
        COUNT=$(($COUNT+1))
        echo $COUNT > "${COUNT_FILE}${ERROR_TYPE}"
}

send "check_dns" "message"
ooshro
  • 11,134
  • 1
  • 32
  • 31
  • Wow. So I assume it would send a max of 10 messages in 200 seconds? – mellowsoon Feb 21 '11 at 07:02
  • @mellowsoon Change to one messages in 300 second – ooshro Feb 21 '11 at 07:22
  • @ooshro, thanks, it works like a charm. Glad you came up with this. I'm the kind of person that would have written a whole system involving databases and messaging queues to solve such a simple problem. – mellowsoon Feb 21 '11 at 07:55
  • @ooshro, I had to take the answer back. This solution suffers from a few major flaws. 1) It crashes if either of the temp files doesn't exist. 2) It doesn't take into account the type of message. I might set it to send only 1 text in 10 minutes, however that should be *per* error type. Meaning I'd still get several texts within 10 minutes if they're different messages. – mellowsoon Feb 24 '11 at 13:52
  • @mellowsoon I update script – ooshro Feb 24 '11 at 23:38
  • @ooshro, thanks. Now it's working as expected. Only thing I'm doing a little a different is passing a md5 hash of the message as ERROR_TYPE to distinguish one type of message from another. – mellowsoon Feb 26 '11 at 20:37