server stops sending SYN ACK after several normal connections

Question

I have a few thousand devices behind a NAT talking to two servers. Each device is behind a local router (think modem/router), at which they get NATed to a private network that has thousands of these devices, and at the gateway for this private network, TCP sessions from these thousands of devices get NAT overload / PAT dynamically to ports on a single global IP address. This would mean that, say device 1 would talk to the server and the connection would come from global_ip_of_the_router:port_number_1. Once device 1 is done talking, and the NAT association removed, when device 2 wants to talk to the same server, the remote router could assign device 2 the same global port i.e. the server could see the new TCP connection come from global_ip_of_the_router:port_number_1

The devices themselves start a TCP connection, do an HTTP post of a small file, tear down the TCP connection, make a new connection for the next file, etc. This works fine for ~20 files, after which on a SYN, the device gets back just ACK with no SYN from the server. The ACK has a completely different ACK number than the sequence number on the SYN. The device immediately sends a RST, backs off and tries a SYN from the same source port after 1 sec, still just ACK, so it keeps backing off to 3,6,12,24,48 seconds before giving up. On the RSTs from the device, it seems to be using a SEQ following from the ACK, in an attempt to shut down an old connection (from the server perspective)

The remote host is an AWS ELB. Here are the hypotheses we have had and what we have tried:

The remote router must be treating the TCP session dead and timing out the NAT and reusing the global port faster than the destination server (ELB). This may be causing the ELB to be in TCP_TIME_WAIT which is why it responds to the SYN with ACK. Since ELB's TCP TIME WAIT is not known, assuming it was the standard 60 sec default in the Linux Kernel, it would match the post-FIN/RST NAT timeout on the remote router. Nevertheless, we changed it on the router to 70 seconds to avoid any race conditions. This did not make the issue go away. We figured, if the remote router killed the NAT sooner, it would assign a new NAT to the SYN retries as the device does its backoff. And if the issue on the dest server was tied to the global port number in use on the remote router, seeing the new SYN come from a new global port on the router's IP should cause it to get out of the weird state. Now, although we could see this work, it looked like the newly assigned NAT port was also hitting the same issue at the server with it returning a spurious ACK, BUT with yet another different ACK number. One of the other hypothesis was that this was only happening when the SEQ on the SYN was lower than where the sequence numbers on the last connection that used the same global port on the remote router. i.e. the ACK number on the spurious ACK would always be higher than the SEQ on the SYN. (We switched Wireshark to absolute sequence numbers to see this). However it turns out that we are seeing instances where the SYN SEQ is more than the ACK number on the spurious ACK. So that theory went by the wayside. We are now at a loss as to what could be happening here. Our suspicion was on the new connection getting the same global port as an old connection, however, if that was the case, (a) by making the router keep NAT longer, it should have prevented it, AND (b) by having the router kill off the NAT earlier and assign a different NAT to the same connection attempt, that should have sidestepped the issue.

Any help here in understanding the behavior would be very very appreciated.

Wireshark trace here: http://www.filedropper.com/traffictrace-anonymizedandpacketswithpayloadremoved

Please note that the trace has been anonymized (IPs and MACs replaced) and all TCP packets with payload have been removed. The first instance of the problem starts at packet 129, second instance packet 382, then 463, 699, 816, 1120, 1278, 1323 etc.

Looking at the very last instance in the trace, this is where we shortened the NAT post-FIN/RST timeout on the router. You can see that the first four times, the ACK has the AKC number = 2899295595. But on number 5, the ACK is 3102149417. On number 6, it is 4158039292. This is because here, the router is set to time out the NAT sooner, so these attempts are coming from a different global port on the router. If the issue was related to the global port and the previous connection that used the global port, this should have stopped it. But the problem persists which leads us to believe this is not source port related, but resulting from something in the TCP SYN itself.

Yesterday, we tried setting the NAT post-FIN/RST timer to 300 seconds and these broken connection went away. My guess is we delayed port reuse to some point at which the ELB has discarded the previous connection. We are wondering if the idle timeout on the ELB set to 295 seconds is also the same value being used for TCP_TIME_WAIT or to treat the connection as valid even after FIN. Although if that were the case, we should have seen a lot more connections fail because there is rampant port reuse on the router. Would be good to know exactly what is going on.

Please delete [the SO version of your question](http://stackoverflow.com/q/42075977/560648), rather than crossposting. — Lightness Races in Orbit, Feb 09 '17 at 01:26

server stops sending SYN ACK after several normal connections

0 Answers0