2

I have a server instance on an OpenStack that at a pretty high load starts loosing UDP packets. I captured all outgoing packets using tcpdump and some of them are missing, even though application logs imply that they should have been sent. Usual packet size is around 60-120 bytes.

Running netstat -s gives:

[root@myServer] ~> netstat -s | grep Udp: -A 5
Udp:
    3855490640 packets received
    133199 packets to unknown port received.
    89 packet receive errors
    4116940753 packets sent
    SndbufErrors: 1396176

When the server is under load, SndbufErrors keeps increasing. I tried to figure out what might be causing it but with no luck even though it feels like this should be covered somewhere.

Q: What might be the reasons for this and how could I resolve this?

Investigation I've done:

  1. Running ifconfig -a doesn't show any errors:

      UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
      RX packets:6361554048 errors:0 dropped:0 overruns:0 frame:0
      TX packets:6902945025 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 txqueuelen:1000
    

    I tried increasing txqueuelen to 10'000 (by running ifconfig eth1 txqueuelen 10000), but it didn't make a difference.

  2. Running several sysctl commands I get:

    net.core.rmem_max = 124928
    net.core.wmem_max = 4194304
    net.core.rmem_default = 124928
    net.core.wmem_default = 124928
    

    I tried increasing net.core.rmem_max and net.core.wmem_max to much bigger number 16'777'216, but still keep getting the same errors.

  3. Running sar -n UDP 1 1 gives (approximated values, but no errors):

    05:47:31 PM    idgm/s    odgm/s  noport/s idgmerr/s
    05:48:46 PM  23000.00  24000.00      0.00      0.00
    
  4. Running ethtool on the Openstack VM instance mostly results in Operation not supported. Running ethtool on the Openstack host server choosing its interface that is used to communicate with outside world, I get:

    [root@myServer] ~> ethtool em1
        Speed: 1000Mb/s
        ... 
    
    [root@myServer] ~> ethtool -g em1
        Ring parameters for em1:
        Pre-set maximums:
        RX:             4096
        RX Mini:        0
        RX Jumbo:       0
        TX:             4096
        Current hardware settings:
        RX:             256
        RX Mini:        0
        RX Jumbo:       0
        TX:             256
    

    I am not convinced that it can be related though as errors I see are inside the VM and not the Openstack host server. Update: I increased the RX and TX values, but to no success.

eddyP23
  • 243
  • 4
  • 11
  • What are your udp buffer memory settings and what are the average size packets you are sending? Also, what is the speed of the nic and what is your peak throughput? – Aaron Mar 07 '16 at 15:57
  • @Aaron I updated question with an average packet size. How do I find out buffer memory settings? I looked at `/etc/systl.conf` but it doesn't have anything useful. And also, how do I find speed of nic? – eddyP23 Mar 07 '16 at 16:12
  • Usually `ethtool DEVICENAME` or `mii-tool` unless this is one of your VM's in openstack, then you would need to run those commands on the compute bare metal node. The sysctl settings above are not default so you might ask whomever tuned them to assist in your troubleshooting. – Aaron Mar 07 '16 at 16:34
  • I noticed you updated the question to include the packet size, ty. 60-120 bytes is very small. Do you know what application is creating them and how many packets per second you are sending? If you install `iftop` you can see the throughput. `sar` can give you some idea of packet rate. `sar` may even give more details about packets being drops in percentages. – Aaron Mar 07 '16 at 16:50
  • @Aaron I know the application that is sending the requests, but it is not helping me in diagnosing why they are missing. – eddyP23 Mar 10 '16 at 17:05
  • @eddyP Does the application do transmit pacing? If not, you should expect to lose lots of datagrams. – David Schwartz Mar 10 '16 at 18:34
  • @DavidSchwartz what do you mean by transmit pacing? And why should I expect to lose lots of datagrams? `RcvbufErrors` isn't reporting any errors. – eddyP23 Mar 10 '16 at 19:02
  • @eddyP Transmit pacing means that you carefully control the timing of datagram sending to avoid datagram loss. If you don't do this, you should expect to lose lots of datagrams because that's how datagrams work. Errors won't necessarily be reported because the loss can occur at various different points in the path and each error counter only tracks some single point. You likely need to set a burst level and a data rate and delay sending UDP datagrams that exceed the allowed data rate and burst level. Otherwise, a "clump" of datagrams will result in loss somewhere. – David Schwartz Mar 10 '16 at 19:06

0 Answers0