Nagios plugin to take process snapshot when load is high

Question

We have configured Nagios with check_load via NRPE plugin to monitor server load, it reports when load is high, but does not have option to take a snapshot top processes (like top command) at that time.

Are there any nagios NRPE plug-ins for that?

score 12 · Accepted Answer · edited Jul 23 '15 at 18:03

12

You can do it with event handlers.

First, add an event handler for your Load average definition:

define service{
    use                     generic-service
    host_name               xx
    service_description     Load_Average
    check_command           check_nrpe!check_load
    event_handler           processes_snapshot!xx
    contact_groups          admin-sms
}

The processes_snapshot command is defined in commands.cfg:

define command{
    command_name    processes_snapshot
    command_line    $USER1$/eventhandlers/processes_snapshot.sh $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $HOSTADDRESS$
}

And second, write an event handler script (processes_snapshot.sh):

#!/bin/bash

case "$1" in
    OK)
        ;;
    WARNING)
        /usr/local/nagios/libexec/check_nrpe -H $4 -c processes_snapshot
        ;;
    UNKNOWN)
        ;;
    CRITICAL)
        /usr/local/nagios/libexec/check_nrpe -H $4 -c processes_snapshot
        ;;
esac

exit 0

The command processes_snapshot is defined in nrpe.cfg on the xx host as belows:

command[processes_snapshot]=top -cSbn 1 | tail -n +8 | sort -rn -k11 | head > /tmp/proc_snap.txt

PS: I haven't tested this config.

edited Jul 23 '15 at 18:03

sebix

4,313
2
29
47

answered Oct 05 '11 at 05:53

quanta

51,413
19
159
217

2

This looks like it puts the snapshot in a file in /tmp. Is it possible to get the list in the notification email that nagios sends? – Marius Gedminas Dec 10 '12 at 11:51
There's a bug in the nrpe.cfg top command line: `-cSb n 1` should be `-cSbn 1`, otherwise you get `top: unknown argument 'n'`. – Marius Gedminas Dec 10 '12 at 11:56
@MariusGedminas: 1. How do you send an email from the Nagios? Can you pipe the `top` output to `mail -s 'process snapshotting' youremail@domain.com`? 2. My `top` version on the CentOS, Gentoo, Ubuntu is working fine. – quanta Dec 10 '12 at 14:45
I use whatever the Debian package sets up by default. I was hoping to see the process snapshot *in the same email* as the CRITICAL/WARNING notification; a separate one would be trivial to arrange, as you suggest. The version of top that was complaining about the space in front of 'n' was from procps 1:3.2.8-11ubuntu6 (Ubuntu 12.04 LTS); top from procps 1:3.3.3-2ubuntu3 (Ubuntu 12.10) doesn't complain. – Marius Gedminas Dec 11 '12 at 08:41
1

@MariusGedminas: there is some ways to do this: 1. Write your own plugin and append the `top` output to the `$SERVICEOUTPUT$`. 2. Define an additional command to send mail enclosed with `top` output and set it to the `service_notification_commands` in your contact. – quanta Dec 11 '12 at 09:19

score 8 · Answer 2 · answered Dec 27 '12 at 15:52

Here's what I did to get a process list snapshot directly in the notification emails, based on the idea by @quanta. It may contain paths specific to the way Nagios is installed on Debian/Ubuntu machines:

Created a wrapper script /usr/local/sbin/check_load that calls the original and appends the process snapshot if the exit code is 1 (WARNING) or 2 (CRITICAL):
```
#!/bin/sh
/usr/lib/nagios/plugins/check_load "$@" || {
    rc=$?
    echo
    # http://nagios.sourceforge.net/docs/3_0/pluginapi.html
    # | separates long output from perfdata
    COLUMNS=1000 top -cSbn 1|sed -e 's/|/<BAR>/g' -e 's/ \+$//'
    exit $rc
}
```
This sets COLUMNS to a large number so the process names/command lines won't be truncated to 40 characters, run top in batch mode for one iteration (-bn 1), asks for full command lines (-c) and cumulative CPU times (-S) to be shown, then makes sure top's output isn't truncated at the first | character by replacing it with <BAR>.

I find top's default sort order to be adequate -- attempting to re-sort by cumulative CPU time like was suggested in @quanta's answer puts system daemons like init or crond at the top, which doesn't help me figure out which CGI script was responsible for the CPU spike. Also this way I get to keep top's header.

Don't forget to chmod +x /usr/local/sbin/check_load

Edit /etc/nagios-plugins/config/load.cfg and replace the check_load entry

command_line    /usr/lib/nagios/plugins/check_load --warning='$ARG1$,$ARG2$,$ARG3$' --critical='$ARG4$,$ARG5$,$ARG6$'

with

command_line    /usr/local/sbin/check_load --warning='$ARG1$,$ARG2$,$ARG3$' --critical='$ARG4$,$ARG5$,$ARG6$'

Edit /etc/nagios3/commands.cfg and update the notify-service-by-email entry so it includes $LONGSERVICEOUTPUT$ in the generated emails. It's too long to paste here; basically find the Info:\n\n$SERVICEOUTPUT$\n" | /usr/bin/mail bit and change it to Info:\n\n$SERVICEOUTPUT$\n$LONGSERVICEOUTPUT$\n" | /usr/bin/mail.
Restart nagios: service nagios3 restart.

I haven't tried this with NRPE.

score 1 · Answer 3 · edited Jul 23 '15 at 18:07

1

I prefer:

command[processes_snapshot]=top -cSbn 1 | head -14 | tail -8

edited Jul 23 '15 at 18:07

sebix

4,313
2
29
47

answered Sep 07 '13 at 19:49

joshua paul

143
1
1
7

Nagios plugin to take process snapshot when load is high

3 Answers3

Linked