I am trying to use collectd to monitor ping time and interface traffic of upwards of a 150 hosts, using the snmp and ping plugins (nodes are mostly routers). The servers reads stats (ping/snmp) and writes them to disk via the rrdtool plugin. All is fine with few hosts - however when I put in a hundred of them, many of the graphs - especially the ping time ones - become fragmented showing only a fraction of the expected values or nothing at all. The logs (at debug level) show oodles of errors like:
rrdtool plugin: rrd_update_r (...) failed: ... illegal attempt to update using time 1393957157 when last update time is 1393957286 (minimum one second step)
The same sites ping all right from the CLI. and do report some snmp data (though not all, and not reliably).
The FAQs on collectd's site mention client/server time differences or multiple plugin being loaded - both ruled out in this case. Running collectd 5.4.1 On CentOS6. I have tried to jack up red/write threads with no joy.
EDIT: I have since tried to activate the write-graphite plugin, and I have exactly the same faulty graphics in both rrd and graphite. So the problem appears to lie specifically with the ping plugin (and not, say, with disk I/O or the write backend).
EDIT2: The failing hosts have (mostly) NaNs added to the rrd/graphite/cvs files.
EDIT3: After much trial and error I found that failures begin when trying to ping upwards of 59 hosts, at which point the collectd process has about 63 sockets open. So it seems someone could have a problem with more than that amount of sockets. It does not appear to be a hard limit, however, because, configuring 116 hosts in the plugin, I can see collectd opening 118 sockets. So it could be a per thread thing or something within liboping (1.5.1)