TCP connection accepted, but writing data causes it to use a stale connection

Question

The server (192.168.1.5:3001), is running Linux 3.2, and is designed to only accept one connection at a time. The client (192.168.1.18), is running Windows 7. The connection is a wireless connection. Both programs are written in C++.

It works great 9 in 10 connect/disconnect cycles. The tenth-ish (randomly happens) connection has the server accept the connection, then when it later actually writes to it (typically 30+s later), according to Wireshark (see screenshot) it looks like it's writing to an old stale connection, with a port number that the client has FINed (a while ago), but the server hasn't yet FINed. So the client and server connections seems to get out of sync - the client makes new connections, and the server tries writing to the previous one. Every subsequent connection attempt fails once it gets in this broken state. The broken state can be initiated by going beyond the maximum wireless range for a half a minute (as before 9 in 10 cases this works, but it sometimes causes the broken state).

Wireshark screenshot behind link

The red arrows in the screenshot indicate when the server started sending data (Len != 0), which is the point when the client rejects it and sends a RST to the server. The coloured dots down the right edge indicate a single colour for each of the client port numbers used. Note how one or two dots appear well after the rest of the dots of that colour were (and note the time column).

The problem looks like it's on the server's end, since if you kill the server process and restart, it resolves itself (until next time it occurs).

The code is hopefully not too out-of-the-ordinary. I set the queue size parameter in listen() to 0, which I think means it only allows one current connection and no pending connections (I tried 1 instead, but the problem was still there). None of the errors appear as trace prints where "// error" is shown in the code.

// Server code

mySocket = ::socket(AF_INET, SOCK_STREAM, 0);
if (mySocket == -1)
{
  // error
}

// Set non-blocking
const int saveFlags = ::fcntl(mySocket, F_GETFL, 0);
::fcntl(mySocket, F_SETFL, saveFlags | O_NONBLOCK);

// Bind to port

// Union to work around pointer aliasing issues.
union SocketAddress
{
  sockaddr myBase;
  sockaddr_in myIn4;
};

SocketAddress address;
::memset(reinterpret_cast<Tbyte*>(&address), 0, sizeof(address));
address.myIn4.sin_family = AF_INET;
address.myIn4.sin_port = htons(Port);
address.myIn4.sin_addr.s_addr = INADDR_ANY;
if (::bind(mySocket, &address.myBase, sizeof(address)) != 0)
{
  // error
}
if (::listen(mySocket, 0) != 0)
{
  // error
}


// main loop
{
  ...
  // Wait for a connection.
  fd_set readSet;
  FD_ZERO(&readSet);
  FD_SET(mySocket, &readSet);
  const int aResult = ::select(getdtablesize(), &readSet, NULL, NULL, NULL);
  if (aResult != 1)
  {
    continue;
  }
  // A connection is definitely waiting.
  const int fileDescriptor = ::accept(mySocket, NULL, NULL);
  if (fileDescriptor == -1)
  {
    // error
  }

  // Set non-blocking
  const int saveFlags = ::fcntl(fileDescriptor, F_GETFL, 0);
  ::fcntl(fileDescriptor, F_SETFL, saveFlags | O_NONBLOCK);

  ...
  // Do other things for 30+ seconds.
  ...
  const int bytesWritten = ::write(fileDescriptor, buffer, bufferSize);
  if (bytesWritten < 0)
  {
    // THIS FAILS!! (but succeeds the first ~9 times)
  }

  // Finished with the connection.
  ::shutdown(fileDescriptor, SHUT_RDWR);
  while (::close(fileDescriptor) == -1)
  {
    switch(errno)
    {
    case EINTR:
      // Break from the switch statement. Continue in the loop.
      break;
    case EIO:
    case EBADF:
    default:
      // error
      return;
    }
  }
}

So somewhere between the accept() call (assuming that is exactly the point when the SYN packet is sent), and the write() call, the client's port gets changed to the previously-used client port.

So the question is: how can it be that the server accepts a connection (and thus opens a file descriptor), and then sends data through a previous (now stale and dead) connection/file descriptor? Does it need some sort of option in a system call that's missing?

When you shutdown your connection, try using `SHUT_WR` only first, then wait until `recv` returns `0` before `SHUT_RD` and `close`. — jxh, Jul 02 '12 at 09:52
Instead of ignoring all errors except EINTR and treating them as success you should at least `perror()` them. Then at least you would have some idea what's happening, rather than trying to debug a packet sniff trace. Clearly you have some scoping problem with your file descriptor variable. @user1495449 it isn't necessary to `shutdown()` in any mode at all unless there are inherited socket FD's, and even then it isn't necessary to shutdown for read before closing. — user207421, Jul 02 '12 at 10:09
In your wireshark it looks like there are some dropped packets. I see retransmissions, "TCP window full"... maybe those late packets are also retransmissions of packets that were dropped the first time and that's why they're out of order. Can you do a ping and see if there's a large drop rate? — Alan Curry, Jul 02 '12 at 10:55
@EJP - sorry, I probably worded that badly. By "None of the errors appear as trace prints where "// error" is shown in the code", I meant that in the actual code, I do have trace prints in the place of "// error", and no errors were printed in execution, so I am confident that those error cases are not the issue. (So yes, I do have to stoop down to the level of debugging a packet sniff trace.) In the actual non-simplified code, there is proper scoping for the file descriptor variable. I have since read more about `shutdown()` since I first wrote those parts, but thanks. — Jetski S-type, Jul 02 '12 at 12:35
@AlanCurry - there are indeed dropped packets/retransmissions/etc. I got the connection to drop for testing purposes by going beyond the maximum range of the wireless network, as explained above. Once I get it into the "bad state", it never gets out of the bad state, even if the wireless devices are right next to each other with low latency (and besides, out of order packets in the order of a minute apart is inexcusable for normal TCP behaviour). — Jetski S-type, Jul 02 '12 at 12:39
The client (port 49956) drops the connection (packet 10492 & 10496) after the connection is established. I doubt there is some bug in your client side code. Could you post your client side code here? — ciphor, Jul 02 '12 at 13:01
If the link layer is still dropping packets after you bring them back within range, then I don't think there's much you can do about it at the application layer. And it's normal for TCP to keep trying for several minutes. — Alan Curry, Jul 02 '12 at 13:01
Also: what's with the `// Do other things for 30+ seconds.` comment? Is a 30 second pause between accept() and write() considered normal for this interaction? You've got a client that hangs up less than 3 seconds after a connection is established (packets 10473. 10475, 10476, 10492). What the server did next in that case is not a surprise: it wrote to the socket which was at that point half-closed. And got a RST because the other computer considered the connection completely closed. Normal TCP operation. — Alan Curry, Jul 02 '12 at 13:22
one last piece of the puzzle: [the listen queue size argument is only a hint and small values are ignored.](http://stackoverflow.com/questions/5111040/listen-ignores-the-backlog-argument/5111340) — Alan Curry, Jul 02 '12 at 13:48
@ciphor - the part I abstracted away was that between the client and server, there are two TCP connections (with port 3001 being the secondary one; the primary one has no out-of-sync issues like this). Where packet 10492 drops the connection, this was actually initiated by the client by sending an application-layer "end session" message. So that is not a suspicious part. I also doubt there is a bug in my client side (or did you mean "not"?). What makes you think the client code could be useful? — Jetski S-type, Jul 03 '12 at 00:37
@AlanCurry - what makes you think packets are being dropped after packet 10057? The missing numbers in the "No." column are from a different TCP port or another protocol (see above comment by me). Wireshark doesn't skip numbers on the "No." column when sequence numbers jump or packets are dropped. I am convinced this is not normal TCP behaviour to attempt to send packets to an old port that should have been dropped a minute ago. — Jetski S-type, Jul 03 '12 at 00:44
never mind that comment. I was probably wrong about that part. Read my more recent comments though! — Alan Curry, Jul 03 '12 at 00:45
@AlanCurry - yes, a 30 second pause between the accept() and write() is normal for interaction on this particular port (there is more frequent communication in another primary port). I mentioned just two comments above this one just now that the connection closing at 10492 was a proper client-initiated connection close on the application layer over the other primary connection, so that is not suspicious that it dropped (I don't think - I did all sorts in testing). What is suspicious, however, is that the server didn't FIN back the client's FIN in packets 10492 & 10496. — Jetski S-type, Jul 03 '12 at 00:53
How do I make it fully closed instead of half-closed then? And thanks for the piece of puzzle there about the backlog... I did overlook that it was a hint. — Jetski S-type, Jul 03 '12 at 00:54
No, that isn't suspicious. It's perfectly allowable in TCP for one side to FIN and the other side to keep sending. The server doesn't need to send a FIN to say "I saw your FIN and acknowledge it" - that is accomplished by a normal ACK. A FIN from the server would mean "I'm done sending too". — Alan Curry, Jul 03 '12 at 00:57
So what you want to do is, during the 30 seconds of processing time, detect that the client has hung up, and abort the slow request processing stuff, and immediately close() the socket and do the next accept(). I can't tell you exactly how to do that without seeing the code that runs during those 30 seconds. But maybe you can throw in a select() or poll() periodically? The socket will become readable when the first FIN arrives, and read() will return 0 (EOF). — Alan Curry, Jul 03 '12 at 00:59
Ah, so that's the question now. A FIN is sent by the close(), right? How do I ensure the FIN is being sent by the server then? The TCP termination diagram from Wikipedia suggests that both ends need to FIN for it to be a complete termination: https://upload.wikimedia.org/wikipedia/commons/5/55/TCP_CLOSE.svg — Jetski S-type, Jul 03 '12 at 01:05
When the server closes the socket that it got from the accept(), a FIN will be sent. (Assuming the socket fd wasn't dup'ed or passed to another process or somehow kept alive). You can also send a FIN with shutdown(SHUT_WR) but that's for when you want to keep reading after you're done writing, which doesn't apply in this case. — Alan Curry, Jul 03 '12 at 01:08
Thanks @AlanCurry, I've got a band-aid fix now. This connection is meant to be one way (write-only for the server), but in the main select() call, I added this write-only connection. If this troublesome connection is set, then I read from it, acknowledge that it is 0 bytes (it should never not be), and then disconnect. So this does help part of the pain of connecting and then finding out 30+s later when it tries to send data that it's actually a broken pipe. What it does now is disconnect immediately for every connection attempt from that point on (it never recovers from the broken state). — Jetski S-type, Jul 03 '12 at 02:33
So then restarting the server process does fix it, but it would be nice to not break in the first place. — Jetski S-type, Jul 03 '12 at 02:33

score 1 · Answer 1 · answered Jul 03 '12 at 02:12

I'm submitting an answer to summarize what we've figured out in the comments, even though it's not a finished answer yet. It does cover the important points, I think.

You have a server that handles clients one at a time. It accepts a connection, prepares some data for the client, writes the data, and closes the connection. The trouble is that the preparing-the-data step sometimes takes longer than the client is willing to wait. While the server is busy preparing the data, the client gives up.

On the client side, when the socket is closed, a FIN is sent notifying the server that the client has no more data to send. The client's socket now goes into FIN_WAIT1 state.

The server receives the FIN and replies with an ACK. (ACKs are done by the kernel without any help from the userspace process.) The server socket goes into the CLOSE_WAIT state. The socket is now readable, but the server process doesn't notice because it's busy with its data-preparation phase.

The client receives the ACK of the FIN and goes into FIN_WAIT2 state. I don't know what's happening in userspace on the client since you haven't shown the client code, but I don't think it matters.

The server process is still preparing data for a client that has hung up. It's oblivious to everything else. Meanwhile, another client connects. The kernel completes the handshake. This new client will not be getting any attention from the server process for a while, but at the kernel level the second connection is now ESTABLISHED on both ends.

Eventually, the server's data preparation (for the first client) is complete. It attempts to write(). The server's kernel doesn't know that the first client is no longer willing to receive data because TCP doesn't communicate that information! So the write succeeds and the data is sent out (packet 10711 in your wireshark listing).

The client gets this packet and its kernel replies with RST because it knows what the server didn't know: the client socket has already been shut down for both reading and writing, probably closed, and maybe forgotten already.

In the wireshark trace it appears that the server only wanted to send 15 bytes of data to the client, so it probably completed the write() successfully. But the RST arrived quickly, before the server got a chance to do its shutdown() and close() which would have sent a FIN. Once the RST is received, the server won't send any more packets on that socket. The shutdown() and close() are now executed, but don't have any on-the-wire effect.

Now the server is finally ready to accept() the next client. It begins another slow preparation step, and it's falling further behind schedule because the second client has been waiting a while already. The problem will keep getting worse until the rate of client connections slows down to something the server can handle.

The fix will have to be for you to make the server process notice when a client hangs up during the preparation step, and immediately close the socket and move on to the next client. How you will do it depends on what the data preparation code actually looks like. If it's just a big CPU-bound loop, you have to find some place to insert a periodic check of the socket. Or create a child process to do the data preparation and writing, while the parent process just watches the socket - and if the client hangs up before the child exits, kill the child process. Other solutions are possible (like F_SETOWN to have a signal sent to the process when something happens on the socket).

Thanks @AlanCurry for your effort in answering. Before reading this answer, I did actually happen to find the cause and fix it because of your earlier pieces of the puzzle. It was tricky to visualise the problem, because it's complex and I didn't explain it all. The server actually doesn't normally close the connection - only if it can't communicate to the client. The client is always the same process connecting to the server. The client only gives up waiting because it is out of sync (see the A & B connection description in my answer). — Jetski S-type, Jul 03 '12 at 07:36
The server wants to send many MBs of data actually, but the 15 bytes is the header. It begins sending a few KBs just after the RST too. There's bits and pieces of the puzzle here, so I'll upvote it (oops - don't have enough reputation to yet...). Cheers. — Jetski S-type, Jul 03 '12 at 07:38

score 0 · Accepted Answer · answered Jul 03 '12 at 07:24

Aha, success! It turns out the server was receiving the client's SYN, and the server's kernel was automatically completing the connection with another SYN, before the accept() had been called. So there definitely a listening queue, and having two connections waiting on the queue was half of the cause.

The other half of the cause was to do with information which was omitted from the question (I thought it was irrelevant because of the false assumption above). There was a primary connection port (call it A), and the secondary, troublesome connection port which this question is all about (call it B). The proper connection order is A establishes a connection (A1), then B attempts to establish a connection (which would become B1)... within a time frame of 200ms (I already doubled the timeout from 100ms which was written ages ago, so I thought I was being generous!). If it doesn't get a B connection within 200ms, then it drops A1. So then B1 establishes a connection with the server's kernel, waiting to be accepted. It only gets accepted on the next connection cycle when A2 establishes a connection, and the client also sends a B2 connection. The server accepts the A2 connection, then gets the first connection on the B queue, which is B1 (hasn't been accepted yet - the queue looked like B1, B2). That is why the server didn't send a FIN for B1 when the client had disconnected B1. So the two connections the server has are A2 and B1, which are obviously out of sync. It tries writing to B1, which is a dead connection, so it drops A2 and B1. Then the next pair are A3 and B2, which are also invalid pairs. They never recover from being out of sync until the server process is killed and the TCP connections are all reset.

So the solution was to just change a timeout for waiting on the B socket from 200ms to 5s. Such a simple fix that had me scratching my head for days (and fixed it within 24 hours of putting it on stackoverflow)! I also made it recover from stray B connections by adding socket B to the main select() call, and then accept()ing it and close()ing it immediately (which would only happen if the B connection took longer than 5s to establish). Thanks @AlanCurry for the suggestion of adding it to the select() and adding the puzzle piece about the listen() backlog parameter being a hint.

TCP connection accepted, but writing data causes it to use a stale connection

2 Answers2