2

I've recently installed netdata on an Amazon EC2 debian instance that I have. Netdata is pretty cool, nice charts/graphs, painlessly easy to install (compared to others.)

A number of times each day I receive a message such as

1m ipv4 udp receive buffer errors = 9 errors
number of UDP receive buffer errors during the last minute

and a few minutes later, a recovery message. There are probably hundreds of errors indicated with UDP/TCP through out the day. I also see a similar pattern on a server I have runnning at home.

I've used other monitoring packages over the years and have never seen errors of this type. I suspect some level of errors, especially on UDP, are normal, is that right? Is this expected behavior? Can I turn off the monitoring of these alarms?

I've moved to a second NIC on the machine at home with no essential change in behavior.

This Acceptable number of ethernet errors in a medium sized environment? suggests that I might have a serious problem, and I can certainly try other NICs at home. But however would I solve this on my EC2 instance?

It may also be worth noting that logwatch reports no problems at all, but then, it may not be configured for this.

Thanks for guidance.

ofirule
  • 143
  • 7
bo gusman
  • 23
  • 1
  • 4

1 Answers1

7

netdata uses statsd as the metrics collection system. This is a UDP based protocol which is incredibly fast and efficient, but at high rates can overflow the recv_buffer of the ingress node. The default receive buffer is around 1M - so if the statsd agent isn't able to consume quick enough to keep the buffer from filling up, the kernel will drop datagrams.

The simple solution is to increase your recv buffer to a larger size to handle spikes - this usually solved issues of UDP buffer overruns. If you still consistently see the log above, you will need to increase the CPU capacity of the machine or move to a more performant statsd implementation (we had to move from the standard nodejs based statsd client to a C++ based one).

To increase the buffer sizes, use the following commands:

# echo "net.core.rmem_default=8388608" >> /etc/sysctl.conf
# echo "net.core.rmem_max=16777216" >> /etc/sysctl.conf
# sysctl -p

The above parameters are very aggressive, and will increase the memory usage of the kernel stack. You may want to start at smaller values and increase from there - the traditional ratio is max = default * 2.

More information is available here: https://www.ibm.com/support/knowledgecenter/en/SSQPD3_2.6.0/com.ibm.wllm.doc/UDPSocketBuffers.html

Brennen Smith
  • 1,742
  • 8
  • 11
  • @Brennan Smith: many thanks for this. I've tweaked things over the last day or so and finally have values that seem to work. I've set a max buffer of 750K and that seems to be the right number. – bo gusman Mar 02 '18 at 18:42