Nagios multiple check attempts for hard OK

Question

Looking for a configuration option to make checks that respond with OK not reach a hard state until max_check_attempts has been reached.

Issue is we'll have a service that we can bring online then will go right back down in a couple of minutes, but when it comes back up it sends out the OK notification, which closes out the issue in our ticketing system.

Maybe you should redesign your check so that it does not report success unless the service is really fully running ? — b0fh, Aug 21 '12 at 22:32
Can't. Monitoring a projector (about 150 of them) where it will report that it's off, you can turn it on, then it will turn itself off a couple of minutes later. No warnings or anything, it's just self preservation of the projector. — cpuguy83, Aug 21 '12 at 23:14

score 0 · Answer 1 · answered Aug 31 '12 at 01:59

My recommendation would be to first determine how long after an outage of a projector is an acceptable time-frame before considering an outage a new outage versus still part of the last outage.

Depending on how like that window is, I would say to follow @b0fh suggestion and redesign the check. If the window is short (several minutes) simply tell the check on an OK result to sleep for X number of minutes and then rerun the check; if it passes the second time then send the exit code 0. However, if the window is longer then several minutes, I would say a better option is to redesign the check essentially with status caching (so that you can compare an outage/device up against the cache). In order for this method to be more effective, you may need to run the script/check as a scheduled job on the Nagios host and have it send passive check results to Nagios.

score 0 · Answer 2 · answered Aug 31 '12 at 02:19

0

Nagios considers a host or service that is acting in the manner you describe to be flapping. You may wish to tweak your flap detection for this host/service.

answered Aug 31 '12 at 02:19

Michael Hampton

244,070
43
506
972

Not quite. A flapping host/service is one that is constantly going back and forth. – cpuguy83 Oct 05 '12 at 14:34

score -2 · Answer 3 · edited Aug 24 '16 at 00:10

-2

Use the check_command to override default host check and define a custom check/script that will do desired number of checks before declaring state. Even easier, again using the check_command, define a new check-host-alive that has multiple pings before declaring host is down. The default is one ping.

edited Aug 24 '16 at 00:10

Community

1

answered Aug 23 '16 at 11:37

Senthil

172
1
1

Nagios multiple check attempts for hard OK

3 Answers3