2

Maybe somebody will be able to help me out with this. I'm trying to find out if there is anything that can be optimized server-side to reduce delays in case of packet loss.

Environment: Windows 2012 client, CentOS 6.x server [Couchbase], same datacenter, busy LAN with firewalls to traverse. Both are large physical servers with plenty of spare capacity.

Issue: as measured from the client, response times are nicely distributed around ~1ms, but we see a spike at ~200ms.

A network trace shows this:

  1. Client -> send request
  2. Server -> replies (1 ms) with a packet with {application response + TCP ack to request packet} (78 bytes in this case)
  3. The packet is NOT received by the client
  4. after ~30 ms, the client TCP stack retransmits the original request
  5. The server replies immediately with a DUP ACK (66 bytes, does not contain the application response)
  6. After ~200 ms from the initial request, the server retransmits the original response (78 bytes packet).

Any idea where does this 200ms delay come from, and how to reduce it? I'd guess some combination of tcp delayed acks, nagle and congestion/RTO algorithms, but linux kernel tuning is a bit of a mystery to me.

Any suggestion?

PP_2
  • 21
  • 2
  • How and Where do you measure? Do you have Wireshark running on both ends? Have you checked the server response (2.) in Detail? Are MAC and IP-address correct (matching with the client)? The Question is: Why is the client not receiving the Server Reply (2.)? The client behaviour seems like pure TCP. It retransmits a packet after a timeout (it seems to be caused by a timeout here and not by duplicate ACKs received) "_I'd guess some combination of tcp delayed acks, nagle and congestion/RTO algorithms_" Do you even know what your writing there :)? – Hansi Jun 24 '16 at 06:53

1 Answers1

0

yes, wireshark both sides, tcpdump, network traces taken at the switch level (rather high-end Arista 10G switches), traces taken on the firewall (Fortinet), etc. etc.

The problem is not why the client is not receiving the reply. This is a busy network with bursty traffic, so losing one packet in 10,000 is not unexpected. But I need to provide an SLA even when I lose a packet, and this 200 ms of delay is throwing it off.

I mean, experimenting on DEV I can 'fix' the problem by setting the TCP RTO for the client subnet to 5ms via a route command [server-side]. With this, 99.999% of my requests gets answered in under 10ms, and I would meet my SLA. Fine, but what are the drawbacks of doing this in production? Is the RTO the real issue, or am I fixing it by accident? Is this the best possible fix for the issue, or is there something smarter/better (tuned profile? sysctl parameter? prayer to the minix gods?)?

ri-thanks

PP_abroad
  • 1
  • 1