collectd: ping plugin fail randomly on medium/high number of hosts

Question

I am trying to use collectd to monitor ping time and interface traffic of upwards of a 150 hosts, using the snmp and ping plugins (nodes are mostly routers). The servers reads stats (ping/snmp) and writes them to disk via the rrdtool plugin. All is fine with few hosts - however when I put in a hundred of them, many of the graphs - especially the ping time ones - become fragmented showing only a fraction of the expected values or nothing at all. The logs (at debug level) show oodles of errors like:

rrdtool plugin: rrd_update_r (...) failed: ...  illegal attempt to update using time 1393957157 when last update time is 1393957286 (minimum one second step)

The same sites ping all right from the CLI. and do report some snmp data (though not all, and not reliably).

The FAQs on collectd's site mention client/server time differences or multiple plugin being loaded - both ruled out in this case. Running collectd 5.4.1 On CentOS6. I have tried to jack up red/write threads with no joy.

EDIT: I have since tried to activate the write-graphite plugin, and I have exactly the same faulty graphics in both rrd and graphite. So the problem appears to lie specifically with the ping plugin (and not, say, with disk I/O or the write backend).

EDIT2: The failing hosts have (mostly) NaNs added to the rrd/graphite/cvs files.

EDIT3: After much trial and error I found that failures begin when trying to ping upwards of 59 hosts, at which point the collectd process has about 63 sockets open. So it seems someone could have a problem with more than that amount of sockets. It does not appear to be a hard limit, however, because, configuring 116 hosts in the plugin, I can see collectd opening 118 sockets. So it could be a per thread thing or something within liboping (1.5.1)

We got the same problem here with that plugin, as we need to monitor more than 600 servers. Have you found some clue ? — Hanynowsky, Oct 02 '14 at 10:12
Nope. Asked on the collectd devlist & have bee ignored - that appears to be the norm on thath list, every question I've ever asked was...maybe it's just me. I turned to smokeping for my graphs. — Alien Life Form, Oct 07 '14 at 08:50
We're dumping the native ping plugin and writing a custom one with the fping package which looks promising. — Hanynowsky, Oct 08 '14 at 23:45

collectd: ping plugin fail randomly on medium/high number of hosts

0 Answers0