5

We have configured Nagios with check_load via NRPE plugin to monitor server load, it reports when load is high, but does not have option to take a snapshot top processes (like top command) at that time.

Are there any nagios NRPE plug-ins for that?

alexus
  • 13,112
  • 32
  • 117
  • 174
nitins
  • 2,579
  • 15
  • 44
  • 68

3 Answers3

12

You can do it with event handlers.

First, add an event handler for your Load average definition:

define service{
    use                     generic-service
    host_name               xx
    service_description     Load_Average
    check_command           check_nrpe!check_load
    event_handler           processes_snapshot!xx
    contact_groups          admin-sms
}

The processes_snapshot command is defined in commands.cfg:

define command{
    command_name    processes_snapshot
    command_line    $USER1$/eventhandlers/processes_snapshot.sh $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $HOSTADDRESS$
}

And second, write an event handler script (processes_snapshot.sh):

#!/bin/bash

case "$1" in
    OK)
        ;;
    WARNING)
        /usr/local/nagios/libexec/check_nrpe -H $4 -c processes_snapshot
        ;;
    UNKNOWN)
        ;;
    CRITICAL)
        /usr/local/nagios/libexec/check_nrpe -H $4 -c processes_snapshot
        ;;
esac

exit 0

The command processes_snapshot is defined in nrpe.cfg on the xx host as belows:

command[processes_snapshot]=top -cSbn 1 | tail -n +8 | sort -rn -k11 | head > /tmp/proc_snap.txt

PS: I haven't tested this config.

sebix
  • 4,313
  • 2
  • 29
  • 47
quanta
  • 51,413
  • 19
  • 159
  • 217
  • 2
    This looks like it puts the snapshot in a file in /tmp. Is it possible to get the list in the notification email that nagios sends? – Marius Gedminas Dec 10 '12 at 11:51
  • There's a bug in the nrpe.cfg top command line: `-cSb n 1` should be `-cSbn 1`, otherwise you get `top: unknown argument 'n'`. – Marius Gedminas Dec 10 '12 at 11:56
  • @MariusGedminas: 1. How do you send an email from the Nagios? Can you pipe the `top` output to `mail -s 'process snapshotting' youremail@domain.com`? 2. My `top` version on the CentOS, Gentoo, Ubuntu is working fine. – quanta Dec 10 '12 at 14:45
  • I use whatever the Debian package sets up by default. I was hoping to see the process snapshot *in the same email* as the CRITICAL/WARNING notification; a separate one would be trivial to arrange, as you suggest. The version of top that was complaining about the space in front of 'n' was from procps 1:3.2.8-11ubuntu6 (Ubuntu 12.04 LTS); top from procps 1:3.3.3-2ubuntu3 (Ubuntu 12.10) doesn't complain. – Marius Gedminas Dec 11 '12 at 08:41
  • 1
    @MariusGedminas: there is some ways to do this: 1. Write your own plugin and append the `top` output to the `$SERVICEOUTPUT$`. 2. Define an additional command to send mail enclosed with `top` output and set it to the `service_notification_commands` in your contact. – quanta Dec 11 '12 at 09:19
8

Here's what I did to get a process list snapshot directly in the notification emails, based on the idea by @quanta. It may contain paths specific to the way Nagios is installed on Debian/Ubuntu machines:

  1. Created a wrapper script /usr/local/sbin/check_load that calls the original and appends the process snapshot if the exit code is 1 (WARNING) or 2 (CRITICAL):

    #!/bin/sh
    /usr/lib/nagios/plugins/check_load "$@" || {
        rc=$?
        echo
        # http://nagios.sourceforge.net/docs/3_0/pluginapi.html
        # | separates long output from perfdata
        COLUMNS=1000 top -cSbn 1|sed -e 's/|/<BAR>/g' -e 's/ \+$//'
        exit $rc
    }
    

    This sets COLUMNS to a large number so the process names/command lines won't be truncated to 40 characters, run top in batch mode for one iteration (-bn 1), asks for full command lines (-c) and cumulative CPU times (-S) to be shown, then makes sure top's output isn't truncated at the first | character by replacing it with <BAR>.

    I find top's default sort order to be adequate -- attempting to re-sort by cumulative CPU time like was suggested in @quanta's answer puts system daemons like init or crond at the top, which doesn't help me figure out which CGI script was responsible for the CPU spike. Also this way I get to keep top's header.

    Don't forget to chmod +x /usr/local/sbin/check_load

  2. Edit /etc/nagios-plugins/config/load.cfg and replace the check_load entry

    command_line    /usr/lib/nagios/plugins/check_load --warning='$ARG1$,$ARG2$,$ARG3$' --critical='$ARG4$,$ARG5$,$ARG6$'
    

    with

    command_line    /usr/local/sbin/check_load --warning='$ARG1$,$ARG2$,$ARG3$' --critical='$ARG4$,$ARG5$,$ARG6$'
    
  3. Edit /etc/nagios3/commands.cfg and update the notify-service-by-email entry so it includes $LONGSERVICEOUTPUT$ in the generated emails. It's too long to paste here; basically find the Info:\n\n$SERVICEOUTPUT$\n" | /usr/bin/mail bit and change it to Info:\n\n$SERVICEOUTPUT$\n$LONGSERVICEOUTPUT$\n" | /usr/bin/mail.

  4. Restart nagios: service nagios3 restart.

I haven't tried this with NRPE.

Marius Gedminas
  • 484
  • 3
  • 9
1

I prefer:

command[processes_snapshot]=top -cSbn 1 | head -14 | tail -8
sebix
  • 4,313
  • 2
  • 29
  • 47
joshua paul
  • 143
  • 1
  • 1
  • 7