1
  • shinken 2.0.3
  • nrpe 2.15

We are using nsca to perform passive checks.

define service {
    name salt-service
    register 0

    active_checks_enabled 0
    passive_checks_enabled 1
    check_freshness 1
    freshness_threshold 600
    max_check_attempts 2
    check_interval 5
    retry_interval 3
}

define service {
    use salt-service
    service_description syncthing_procs-2
    host_name x
    check_command check_nrpe!syncthing_procs!10
    display_name Syncthing Procs
}

Although the freshness_threshold is 10 minutes, there is a case when passive checks are stale:

Oct 6 09:52:36 x shinken: [Tue Oct 6 09:52:35 2015] Warning : The results of service 'syncthing_procs-2' on host 'x' are stale by 0d 0h 10m 16s (threshold=16714d 9h 42m 35s). I'm forcing an immediate check of the service.

Oh, where is the threshold=16714d 9h 42m 35s come from while I set it to 10 mins in the config file? Sure, the system time on the Shinken VM and the host 'x' is the same.

There are a lot of services are stale like that. As you can see, after a passive check is stale, we use check_nrpe to perform an active check. And the problem is now we have so many nrpe processes which seems hanging:

nagios   31404     1  0 Sep18 ?        00:00:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d
nagios   31727     1  0 Oct01 ?        00:00:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d
nagios   31732     1  0 Oct01 ?        00:00:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d
nagios   32148     1  0 Sep30 ?        00:00:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d
nagios   32157     1  0 Sep30 ?        00:00:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d

I just paste a few. Actually, there are > 200 processes.

So, besides the wrong threshold, I also have another question: why there are so many nrpe processes after that? I know that a new process will be forked when performing an active check. But it should disappear after the check is done, right?


Ah, I know the answer for the first question.

Oh, where is the threshold=16714d 9h 42m 35s come from while I set it to 10 mins in the config file?

Looks like there is a slightly different between Shinken and Nagios. It is the Epoch time in days/hours/minutes/seconds.

expr $(date +%s) / 3600 / 24
16714
quanta
  • 51,413
  • 19
  • 159
  • 217
  • What's state of each `nrpe` processes? It seems that some errors occur so the `nrpe` process became child process of init. – cuonglm Nov 02 '15 at 12:24

1 Answers1

0

it's not possible to tell what exactly went wrong in your case. So here is some thought:

We are using nsca to perform passive checks. why there are so many nrpe processes after that? I know that a new process will be forked when performing an active check. But it should disappear after the check is done, right

That seems nsca does not work properly, then active checks were performed. Make sure that nsca works.

Although the freshness_threshold is 10 minutes, there is a case when passive checks are stale

or nsca is not configured to send passive result to shinken

I know that a new process will be forked when performing an active check. But it should disappear after the check is done, right

Maybe the checks haven't done and connections are kept by other side (shinken)

HVNSweeting
  • 534
  • 2
  • 10
  • 17