4

Looking for a configuration option to make checks that respond with OK not reach a hard state until max_check_attempts has been reached.

Issue is we'll have a service that we can bring online then will go right back down in a couple of minutes, but when it comes back up it sends out the OK notification, which closes out the issue in our ticketing system.

cpuguy83
  • 202
  • 2
  • 8
  • 2
    Maybe you should redesign your check so that it does not report success unless the service is really fully running ? – b0fh Aug 21 '12 at 22:32
  • Can't. Monitoring a projector (about 150 of them) where it will report that it's off, you can turn it on, then it will turn itself off a couple of minutes later. No warnings or anything, it's just self preservation of the projector. – cpuguy83 Aug 21 '12 at 23:14

3 Answers3

0

My recommendation would be to first determine how long after an outage of a projector is an acceptable time-frame before considering an outage a new outage versus still part of the last outage.

Depending on how like that window is, I would say to follow @b0fh suggestion and redesign the check. If the window is short (several minutes) simply tell the check on an OK result to sleep for X number of minutes and then rerun the check; if it passes the second time then send the exit code 0. However, if the window is longer then several minutes, I would say a better option is to redesign the check essentially with status caching (so that you can compare an outage/device up against the cache). In order for this method to be more effective, you may need to run the script/check as a scheduled job on the Nagios host and have it send passive check results to Nagios.

Eli
  • 372
  • 2
  • 8
0

Nagios considers a host or service that is acting in the manner you describe to be flapping. You may wish to tweak your flap detection for this host/service.

Michael Hampton
  • 244,070
  • 43
  • 506
  • 972
-2

Use the check_command to override default host check and define a custom check/script that will do desired number of checks before declaring state. Even easier, again using the check_command, define a new check-host-alive that has multiple pings before declaring host is down. The default is one ping.

Senthil
  • 172
  • 1
  • 1