- shinken 2.0.3
- nrpe 2.15
We are using nsca to perform passive checks.
define service {
name salt-service
register 0
active_checks_enabled 0
passive_checks_enabled 1
check_freshness 1
freshness_threshold 600
max_check_attempts 2
check_interval 5
retry_interval 3
}
define service {
use salt-service
service_description syncthing_procs-2
host_name x
check_command check_nrpe!syncthing_procs!10
display_name Syncthing Procs
}
Although the freshness_threshold
is 10 minutes, there is a case when passive checks are stale:
Oct 6 09:52:36 x shinken: [Tue Oct 6 09:52:35 2015] Warning : The results of service 'syncthing_procs-2' on host 'x' are stale by 0d 0h 10m 16s (threshold=16714d 9h 42m 35s). I'm forcing an immediate check of the service.
Oh, where is the threshold=16714d 9h 42m 35s
come from while I set it to 10 mins in the config file? Sure, the system time on the Shinken VM and the host 'x' is the same.
There are a lot of services are stale like that. As you can see, after a passive check is stale, we use check_nrpe
to perform an active check. And the problem is now we have so many nrpe processes which seems hanging:
nagios 31404 1 0 Sep18 ? 00:00:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d
nagios 31727 1 0 Oct01 ? 00:00:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d
nagios 31732 1 0 Oct01 ? 00:00:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d
nagios 32148 1 0 Sep30 ? 00:00:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d
nagios 32157 1 0 Sep30 ? 00:00:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d
I just paste a few. Actually, there are > 200 processes.
So, besides the wrong threshold, I also have another question: why there are so many nrpe processes after that? I know that a new process will be forked when performing an active check. But it should disappear after the check is done, right?
Ah, I know the answer for the first question.
Oh, where is the threshold=16714d 9h 42m 35s come from while I set it to 10 mins in the config file?
Looks like there is a slightly different between Shinken and Nagios. It is the Epoch time in days/hours/minutes/seconds.
expr $(date +%s) / 3600 / 24
16714