Windows TCP connection failures and retransmissions

Question

I have intermittent TCP connection issues in a complex application that runs on Windows.

I'm trying to determine if the problem is with my code, or a bug in Windows itself.

The system consists of a client application, server application and web application GUI. The GUI connects to the server via the API port, and the client application on a different port.

My test setup has the client program connect through an SSH tunnel that redirects to the server that is running on the same system as the client. The server also listens on an API port on localhost, on a different thread.

The code runs on build 2004 of Windows 10 in VMware workstation.

At certain points in time the server stops temporarily responding to SYN packets. New connections take 2 or 3 seconds to establish and existing connections experience lag due to re-transmissions. Since all connections from my server's/Windows's perspective are coming from localhost, and happen on two different threads I've exhausted explanations on behalf of my own code that could explain the issue.

The issue appears every 20 minutes. This also makes me suspicious that there's something else wrong, unrelated to my code.

I had the chance to obtain a packet dump from a connection attempt using CURL. Which looks like this:

As can be seen in the image, there's a good 3 second delay on localhost between the server responding! The server is a super simple polling design, I can't spot what the problem is. The server's code responsible for accepting this connection looks like this:


cR<void> cWindowsTCPServer::HasConnectionsPending()
{
    fd_set ReadSet = {};

    FD_ZERO(&ReadSet);

    FD_SET((SOCKET)_GenericFD, &ReadSet);

    timeval Timeout = {};
    // This select is not a bug on windows. The nSocks argument is ignored.
    uint32_t SelectResult = select(NULL, &ReadSet, NULL, NULL, &Timeout);

    if (SelectResult == SOCKET_ERROR)
        return cR<void>(false);

    return cR<void>(FD_ISSET((SOCKET)_GenericFD, &ReadSet) != 0);
}

cR<std::shared_ptr<iSocketBase>> cWindowsTCPServer::AcceptConnection()
{
    uint32_t TempFD = INVALID_SOCKET;

    sockaddr_in RemoteAddress = {};

    uint32_t AddrLength = sizeof(RemoteAddress);

    TempFD = (uint32_t)accept(_GenericFD, (sockaddr*)&RemoteAddress, (int*)&AddrLength);

    if (TempFD == INVALID_SOCKET)
        return cR<std::shared_ptr<iSocketBase>>(false);
    // This would not work for IPv6, but ipv4 is hardcoded in all clients that connect here...
    std::string Ip      = inet_ntoa(RemoteAddress.sin_addr);
    uint16_t    Port = ntohs(RemoteAddress.sin_port);

    return cR<std::shared_ptr<iSocketBase>>(true, std::make_shared<cWindowsTCPSocket>(TempFD, Ip, Port));
}

void cAPIServer::handle_server()
{
    while ((bool)_server_socket->HasConnectionsPending() == true)
    {
        auto accepted_client = _server_socket->AcceptConnection();

        std::thread(&cAPIServer::handle_client, this, accepted_client.Value()).detach();
    }
}

void cAPIServer::server_main()
{
    while (_is_running == true)
    {
        handle_server();

        std::this_thread::sleep_for(std::chrono::milliseconds(5));
    }
}

The client, server and SSH together cycle through unused ports at a rate of ~6 per second. But from everything I've read the Windows port exhaustion issue doesn't appear until a client utilizes about 33 connections per second. In perfmon and netstat there are never more than about 22 connections active at a time. I see only about 60 connections in a 'TIME_WAIT' state before they are reclaimed by the system. There's 64k ports available for connecting, so I don't think it's that.

The interval for it to show is always around 20 minutes. The port exhaustion issue would also only affect new connections. But in the screenshot it is clear that the first packet from the client that carries data is also re-transmitted twice, after the connection was established.

Have I made a mistake in my code? Or something I have missed?

Edit:

I've since ran the following experiments:

Run the client on the VM and the server on my host windows 10 machine. The results are the same.
Remove all other network adapters (such as OpenVPN), even though these were not active. Results are the same.
Reboot the system(s) involved. Results are the same.
Disable windows defender realtime scanning. Results are the same.
Open a ncat listener when I notice the packet loss occurring on another port, and connect. It seems more laggy than normal, but I didn't take the time to measure this accurately, so I might be wrong.
Run a netsh trace session and open the events (nothing special, but there was an enormous amount of events, so I could have easily missed something).
Disable mmp and profiles (some sort of TCP syn flood protection in Windows), which had no effect.
Connect the client directly to the server, instead of the SSH tunnel, same results.

Edit 2:

I've noticed a couple more things that deepen the mystery for me. If I terminate the server and client the moment packet loss occurs and then restart them the issue is still there. It must be some novel port exhaustion issue in Windows.

There's no persistence between restarts. No shared database, or shared configuration or any other similar thing. Neither the server of client re-use state from previous runs, so I think this confirms that the issue is not in my code. Even if I instruct the code to use a different port, the packets keep dropping.

Probably not what you've run up against, but `uint32_t TempFD` could be too small to hold a socket handle. `SOCKET` could be backed by a 64 bit integer. Best to stick with `SOCKET` all the way through. — user4581301, Oct 15 '20 at 21:39
@user4581301 Thanks for your response! After checking it does appear that my platform backs SOCKET to a uint64_t. But this does not seem related to my issue, since the accept call succeeds, but there's still re-transmission for the client. — user513647, Oct 15 '20 at 22:07
@rustyx, thanks for the suggestion. Since I last read this I've tried that and more. I've added my other experiments to the main post. I've never seen anything like this and really breaking my head over what it can be :S — user513647, Oct 15 '20 at 23:33
Your accept loop will stop as soon as you encounter a select timeout. Surely this is not what you want? — user207421, Oct 16 '20 at 01:04
@MarquisofLorne Do I misunderstand how select works? Can it block? The goal of that loop is to break once no more clients are waiting. If the select loop stops, the only thing the server does it wait 5 milliseconds before polling again. If that's what you meant that's by design, and I don't understand how it can lead to the described problem. — user513647, Oct 16 '20 at 10:06
`select` can block. One very common use of `select` is to let the sucker run in a loop until you tell it to stop and manage all of the sockets, the listening socket, all of the accepted sockets, and in a Unix-like system any other file descriptor you want managed. Unless you have other things to do in the same thread, just let `select` loop and `accept`, waking up every now and then to see if the thread has been asked to terminate. For five milliseconds (typically below Window's ability to sleep. your tick may actually be 15.625) I'd just stay in the loop. — user4581301, Oct 16 '20 at 16:22
You can use a tool like [Sysinternal's Process Explorer and TCPView](https://learn.microsoft.com/en-us/sysinternals/downloads/) to see if you are accidentally starving the system with a glut of unclosed sockets or similar. — user4581301, Oct 16 '20 at 16:29
@user4581301 I'm using TCP view and ProcessHacker 2 (clone of explorer). I'm seeing about 30 sockets in CLOSE_WAIT. Afaik not enough to lead to the packet loss right? Obviously it leads to some sort of problem... But I still don't know what. And I'm aware of that usage pattern. That socket class has not been refactored in that style. I meant my question more like: "can you see any reason why select would block in the code as given". Since that might potentially explains the timeouts. I don't see how it can block in the code above. — user513647, Oct 16 '20 at 18:15
When you connected the client directly (rather than shrough SSH) did you still connect with the loopback (e.g. by running the client on the VM with the server)? — JimD., Oct 16 '20 at 18:33
@JimD Yes I did, I'm immediately going to try and see if listening on a real interface has any effect. — user513647, Oct 16 '20 at 18:44
I think you are seeing packet loss on the loopback (which I know doesn't seem sensible). Windows loopback has a network layer (unlike linux). If you can upgrade to windows server 12 you could use the "fast path" loopback optimization to directly deliver the packets to the tcp stack. https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2012-r2-and-2012/hh997026(v=ws.11) — JimD., Oct 16 '20 at 18:49

Windows TCP connection failures and retransmissions

0 Answers0