4

I've set up a Proxmox VE Cluster with three nodes. Each nodes has a number of VMs running on it. I'm using the PVE Monitor Plugin to set up the hosts and services, which works fine.

My issue is that Nagios's email-sending behavior is somehow odd. Ideally, I would like to have a check once-per-minute, for both the nodes as well as all services that are running on each node.

My configuration file looks like this:

# Define the cluster itself as a host
# the command check_pve_cluster_nodes give us info
# on the member's cluster state
define host {
        host_name pve-cluster
        max_check_attempts 10
        check_command check_pve_cluster_nodes
    contact_groups admins
    check_interval 1
    contact_groups admins
    notifications_enabled 1
}

# define openvz, qemu and storages as services of the cluster
define service{
        use generic-service
        host_name pve-cluster
        service_description OpenVZ VMs
        check_command check_pve_cluster_openvz
    check_interval 1
    contact_groups admins
    notifications_enabled 1
}


define service{
        use generic-service
        host_name pve-cluster
        service_description Qemu VMs
        check_command check_pve_cluster_qemu
    check_interval 1
    contact_groups admins
    notifications_enabled 1
}


define service{
        use generic-service
        host_name pve-cluster
        service_description Storages
        check_command check_pve_cluster_storage
    check_interval 1
    contact_groups admins
    notifications_enabled 1
}

I haven't changed the time unit settings, so those should be once-per-minute checks. The Nagios Web UI is showing that a host is offline, but email notifications are sent only a couple of minutes later. Furthermore, the email content is missing the most important piece of information - which node/service exactly is in critical state:

Node down

***** Nagios *****

Notification Type: PROBLEM
Host: pve-cluster
State: DOWN
Address: pve-cluster
Info: NODES CRITICAL  2 / 3 working nodes

Date/Time: Fri Mar 6 10:48:25 CET 2015

VM down

***** Nagios *****

Notification Type: PROBLEM

Service: Qemu VMs
Host: pve-cluster
Address: pve-cluster
State: CRITICAL

Date/Time: Fri Mar 6 10:40:44 CET 2015

Additional Info:

QEMU CRITICAL 2 / 3 working VMs

How can I set up the configuration, so that hosts and services (i.e. VMs) are checked in a one-minute-interval? Ideally, re-checks for that status should be sent in 15-minute intervals after that.

Is this even the best workflow? Or is there another, better way to schedule notifications with acknowledging them?

doque
  • 207
  • 3
  • 7

1 Answers1

3

Nagios only sends emails once a host or service has entered a 'hard' state. At a basic level to answer your question - a hard state is reached once the host or service has been checked a number of times specified by max_check_attempts. By default, this is 4.

Info on soft/hard states: http://nagios.sourceforge.net/docs/3_0/statetypes.html Info on max_check_attempts: http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html

It looks like the plugin is definitely INTENDING to give return details, but for whatever reason it isn't. Unfortunately I don't have the environment to test this with so I might have to leave you hanging with that part of the question.

Relevant sections of the perl:

print "NODES $rstatus{$statusScore}  $workingNodes / " .
          scalar(@monitoredNodes) . " working nodes" . $br . $reportSummary;

print "STORAGE $rstatus{$statusScore} $workingStorages / " .
          scalar(@monitoredStorages) . " working storages" . $br . $reportSummary;

print "OPENVZ $rstatus{$statusScore} $workingVms / " .
          scalar(@monitoredOpenvz) . " working VMs" . $br . $reportSummary;

print "QEMU $rstatus{$statusScore} $workingVms / " .
          scalar(@monitoredQemus) . " working VMs" . $br .
          $reportSummary;

$reportSummary is populated with details of the problem sections higher in the code but doesn't seem to be being returned for you.

Taz
  • 147
  • 3
  • 16