nagios wrongly reports packet loss

Question

Lately, on my nagios 3.2.3 install (CentOS5, monitoring ~ 300 hosts, 1150 services) has sdtarted to occasionally report high packet loss on 50-60 hosts at a time. Problem is it's bogus. Manual runs of ping (or its own check_ping binary) finds no fault with any of the affected hosts. The only possible cures I found so far are:

run all the checks manually (they will succeed but it may act up again on next check)
acknowledge and wait for the problem to go away (may take several ours)

I suspect (but have no particular reason other than single rescheduled checks succeeding) that the problem may lay with all the checks being mass scheduled together - in which case introducing some jitter in the scheduling (how?) might help. Or it may be something completely different.

Ideas, anyone?

Edit:

For people interested in constructive debate (rather than point scoring). I am not trying to measure packet loss. Network performance is not my concern in this instance, and if it was, it would be investigated with the proper tools for the job. NAGIOS (for the unwary) is mostly used to check upness in host servicesand to generate alerts. When it starts generating large amounts of fishy alerts is therefore highly annoying. I am 99.9% positive that the problem is either due to either:

some Nagios/Nagios-Plugin snag
some system (memory-cpu- i/O - network stack) problem

possibly caused by the burst of requests sent by the nagios scheduler. The packet losses are all above 50% - if they were real, our phones would be melting. So far I have no evidence for (2), so I am looking for "prior art" in (1). I may well be mistaken in my belief, but, if I have to reach for wireshark or similar, a suggestion on what to look for would be greatly appreciated.

The fact that there's packet loss at time A, but not when checked again at time B, doesn't mean the first result was bogus. I'd be inclined to start by assuming that NAGIOS was telling the truth, and investigate why I was getting intermittent packet loss. — MadHatter, Nov 02 '12 at 09:18
Besides manual checks, I have other independent checks (smokeping, cacti) telling me that it's not the case. The affected hosts are on different remote networks (and different owners) yet other hosts on the same networks do not have the same loss. Several of the hosts are running loss sensitive services (VPNS, mostly) which would drop with the reported loss rates - they don't. Everything happens in lockstep. I could go on, but the bottom line is that it is highly unlikely that Nagios is telling the truth. — Alien Life Form, Nov 02 '12 at 09:26
A reasonable answer. Is there any time correlation in the affected hosts? That is, if all the checks run between (say) 1000 and 1002 gave losses, but those between 1002 and 1004 didn't, the fact that hosts in both groups were on the same networks wouldn't signify. The point about other services being available definitely doesn't signify, since tests for connectivity using different transport media (eg, TCP over a VPN) have different timeouts. What do you see **on the wire** when the losses are occurring? — MadHatter, Nov 02 '12 at 09:38

score 1 · Answer 1 · answered Nov 02 '12 at 09:44

After you have verified the packet loss by different tools, First of all you need to find out which plugin is actually checking for packet loss. Locate that plugin and manually run it after the interval defined in the nagios, and check its output if that can give you a clue. The problem doesnt seems to be that packet loss is there but its the fault plugin. once you have verified the plugin output, then compare that output with output of other tools (to see if it shows any packet loss and if others dont). Usually the plugin is check_ping.

lgfischer · Answer 2 · 2014-06-18T14:34:15.790

I had a similar problem on my first try with Nagios. While trying to solve the problem, I found this blog post, which states that the problem may occur if you are trying to ping on a IPV6 server without an IPV6 address on your server.

So the solution is to rewrite the "check_ping" command on your Nagios configuration object files. On one of our .cfg files, I added the following:

define command {
    command_name    check_ping_ipv4
    command_line    $USER1$/check_ping -4 -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
}

Please note the "-4" parameter after the check_ping command. It will force the ping to use IPV4 only. After defining the command above, I could use it in a service definition. For example:

define service {
    service_description     PING
    host_name               MYHOST
    check_command           check_ping_ipv4!100.0,20%!500.0,60%
    use                     generic-service
}

I believe that the problem described in that blog is something else altogether. Using `check_http` as a plugin for a `check_ping` command also strikes me as at least slightly evil ;) — Felix Frank, Jun 17 '14 at 14:56
Sorry, I copied the wrong piece of code from my Nagios scripts. I've updated my answer ;) — lgfischer, Jun 18 '14 at 14:35

score -3 · Answer 3 · answered Nov 02 '12 at 11:05

Manual runs of ping (or its own check_ping binary) finds no fault with any of the affected hosts

That's a really dumb way to check for packet loss. You should be comparing the retransmits logged for the NIC at intervals (netstat -r) or capturing the traffic using a tool like pastmon or wireshark. Since:

1) you've already said that the packet loss occurs in bursts - how do you know you were running a ping on a path during the time packet loss was occurring?

2) small amounts of packet loss can have a big impact on throughput - which is why we monitor them - if you want to confirm packet loss of 1%, then you'd need to send at least 200 packets across the path - how many did you send?

3) However the overriding WTF here is that TCP, and to a lesser extent UDP, behaves very differently from ICMP - the latter is far less affected by congestion issues (even assuming a consistent 1500 MTU)

i.e. you've provided no valid evidence that the packet loss is bogus. You have however provided evidence that you don't really understand what you were trying to measure.

yet other hosts on the same networks do not have the same loss

Do you think packet loss only occurs between hosts? This is way wrong.

I would +1 this but I don't like the snarky tone at the end. Help the asker understand if you don't think they understand, don't berate them. We all have to start somewhere. — dunxd, Nov 02 '12 at 11:08
Are you sure *YOU* understand what I am trying to measure? How do you know I am concerned about packet loss? (I'm not) Do you know what nagios is and what it is used for? (Hint: not for measuring network performance) How do you know that packet loss is in the order of 1% (It is around 90% when reported) Did you even bother RTFQ (Reading The Fine Question)? If you did, did you stop to think (assuming that activity applies in this case) before deciding that all involved facts and parties are dumb? Have a nice day. — Alien Life Form, Nov 02 '12 at 11:20

nagios wrongly reports packet loss

3 Answers3