I have the following service set up for nagios:
define service {
hostgroup_name LNX
service_description /tmp Disk Usage
check_command check_nrpe!check_disk!-a '-w 20% -c 10% -p /tmp'
check_interval 1
max_check_attempts 3
retry_interval 1
check_period 24x7
notification_interval 2
notification_period 24x7
notification_options c,r,w
notifications_enabled 0
contact_groups devops
}
Which ties to the following command:
define command {
command_name check_nrpe
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -u -t 60 -c $ARG1$ $ARG2$
}
So in the end what's being executed (and its output when run on command line) is:
$: /usr/local/nagios/libexec/check_nrpe -H <my host> -u -t 60 -c check_disk -a '-w 20% -c 10% -p /tmp'
DISK OK - free space: /tmp 4785 MB (97% inode=99%);| /tmp=124MB;3928;4419;0;4910
Following this with echo $?
yields a 0, meaning OK/success.
However, nagios is reporting this as "error code 255 out of bounds" and I'm not sure why.
Running the check_disk command on the server works fine:
$: ./check_disk -w 20% -c 10% -p /tmp
DISK OK - free space: /tmp 4785 MB (97% inode=99%);| /tmp=124MB;3928;4419;0;4910
$: echo $?
0
And as shown above, it works when done through the check_nrpe
executable on the nagios server. This means:
- The command (
check_disk
) is present on the remote system:command[check_disk]=/usr/local/nagios/libexec/check_disk $ARG1$
- The nagios server is able to talk to the remote nrpe (e.g. it can access it on the network and its IP is present in the
only_from
directive in/etc/xinetd.d/nrpe
)
Additionally, this check runs fine on other machines, but not all machines
Why does Nagios think it's getting a 255 when everything I can see means it should be getting 0 and thus marking the service as OK?
EDIT: Nagios version is Nagios core 4 running on CentOS 7, hosts being checked are centos 5-7, the problem appears on multiple machines of varying versions