0

The issue was detected while analyzing some application logs, which reported few seconds long spike periods when messages from multiple clients are received on the server with a substantial delay (up to a couple of seconds). The application itself utilizes persistent connections, over which clients and server are exchanging short messages (much less than MTU) a couple of dozens times per second (think voice data/gaming traffic).

In order to dig deeper, I recorded a tcpdump and figured out that random segments from multiple (but not all) clients get lost during those spikes (so the server sends out a lot of SACKs), and the retransmissions happen in about 300ms in best cases, hence the delay on the application level, while the server waits for the missing fragments. For a particular affected client, it's not just one retransmission per spike, but sort of a series of retransmissions. Commands like ifconfig -a don't report any packet loss, /var/log/syslog is clean. The channel is 10Gbit, while the incoming/outgoing traffic measures at barely 10Mbit in the peak hours.

The question is: what may cause this, which tools can help in spotting a potential problem, where to look? Can this have to do with the server provider?

tonso
  • 111
  • 4
  • A packet loss can happen at any device in between client and server (i.e. router, firewall, load balancer ....) and also on the server. It is often connected with overload of the specific intermediary or end devices, but might also be caused by bugs. To find out where the loss happens you need to do a packet capture at the specific devices to see where exactly the packets get lost. Some self-reported statistics on these devices about packet load and packet loss might help too. – Steffen Ullrich Apr 05 '23 at 10:43
  • This is far too broad, however in my experience, the most common cause of packet loss is insufficient capacity, specifically one pipe connecting to a smaller pipe. – Greg Askew Apr 05 '23 at 10:58
  • @SteffenUllrich The fact that if happens for multiple random clients at the same time suggests that is not an issue with individual clients' devices/routers imo... – tonso Apr 05 '23 at 11:14
  • @tonso: *"not an issue with individual clients' devices/routers"* - I agree. But many clients usually share at least some part of the network path. For example clients using the same ISP will share most of the path, then several ISP might use the same upstream. And even if all come from different ISP and upstream they will share the last part of the path through the infrastructure where the server is located. – Steffen Ullrich Apr 05 '23 at 11:51
  • @GregAskew Ok, but how to narrow down the search then? – tonso Apr 05 '23 at 11:59
  • @SteffenUllrich I analyzed IPs of the clients affected during one spike, and not only they use random various ISPs, but sometimes come from outside of the US. So it seems like it should be a data center infrastructure problem, though it's unclear how do you even approach the server provider regarding this... Probably some built-in DDOS protection? – tonso Apr 05 '23 at 12:12
  • There might be an overload due to traffic spikes - which might be caused by DDoS but might also be non-malicious traffic spikes facing a limited capacity of the provider. – Steffen Ullrich Apr 05 '23 at 12:22
  • `how to narrow down the search then?` There is insufficient information. The only thing we know is the application design uses long-lived connections using a provider that probably doesn't offer an SLA. You could start by defining 'long-lived'. – Greg Askew Apr 05 '23 at 15:37
  • @GregAskew By long-lived I meant the ones that persists, without idling, not the typical one-time bulk download (this is not my definition of long-lived TCP connections). Mmm, but I don't understand why you saying that this is the only information provided. I clearly wrote 'over which clients and server are exchanging short messages (much less than MTU) a couple of dozens times per second (think voice data/gaming traffic).' I truly believe this contains *some* information about the application design. What else should be described? – tonso Apr 05 '23 at 20:17
  • Two suggestions: 1) Use UDP for this type of traffic. 2) If you need to really troubleshoot this implement a few independent probes continuously sending some data back and forth (like echo ping) and monitor their statistics. – Peter Zhabin Apr 05 '23 at 21:33

0 Answers0