Why would TCP packets consistently fail to arrive at the server when the client happily sent them?

Question

I've got a simple client server set up where it seems like TCP packets I'm sending from the client are not arriving at the server.

Normally everything works fine, but when I spin up 50 threads on the client to hit the server "simultaneously" with the same small data packet (which is only 39 bytes), a random number of times the server is not receiving all bytes. Even stranger, is that it is very consistent in how it doesn't receive them... only 5 bytes are received.

I'm using tcpdump and tcpflow to capture what is going on at both ends (if not familiar with tcp flow, it removes the massive amount of TCP SYN/ACK/FIN/etc noise from the TCP stream and just shows you data sent in either direction).

On the client side, for 50 threads firing off the 39 byte packet, it looks perfect. Specifically, tcpflow (which uses libpcap) shows me 50 identical data transfers:

07 B6 00 01 | 00 1E 00 00 | <etc>

As I understand it, libpcap/tcpdump get data from a pretty low level (below the TCP stack) so I take this to mean that the data was sent ok, or at least was not stuck in the kernel buffers.

However, when looking at the server side, all is not perfect. A random number are failing, and it is a high percentage. For example, out of the 50 socket connections, 30 will work fine, but for 20 of them I have a protocol failure where the server's socket.recv times out waiting for bytes (the protocol indicates exact packet length).

It is VERY consistent in how it fails. For the 30/20 case, 30 of the sockets perfectly receive the transmitted 39 bytes. The remaining 20 ALL receive this partial data, after which my socket.recv times out:

07 B6 00 01 | 00

Only 5 bytes are arriving for each of the 20 connections, and it seems to be at the kernel level since tcpdump is only showing 5 bytes arriving as well.

How can this happen?

This 5-byte boundary is not 100% coincidence. It is the first part of a header, and the 34 byte payload comes next, but is not arriving. On the client side it is split like this.

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((HOST, PORT))
sock.sendall(HEADER)  # 5 bytes
sock.sendall(PAYLOAD) #34 bytes

and both sock.sendall calls complete successfully in every thread, as is proven my the tcp logging shows that all 50 runs send 39 bytes "out the door" perfectly.

Any ideas on the root cause of this? What am I missing?

Cazn you reproduce the problem with the clients and the server running on the same machine to rule out any networking hardware in between them? — Shane Wealti, Apr 20 '12 at 16:42
@Shane - It doesn't happen locally at all. In fact, I only noticed this problem after running unit/system tests with the server application running on a remote server (ping of roughly 100 ms). Remote server is running linux kernel 3.0. Client is 2.6.32. — Russ, Apr 20 '12 at 16:51
Can you try running your server application on a remote server on top of a Windows (or other non linux) OS to rule out something specific to the Linux implementation of the TCP/IP stack? Perhaps spin up an Amazon cloud VM for an hour or two and run it on there. — Shane Wealti, Apr 20 '12 at 17:13
I'd suggest running a capturing (`-w`) tcpdump on the sending side. See if you're getting ACKs for the data that "goes missing" on the receiving side. If so, the receiver is screwing up somehow; if not, check for retransmits on the sender side and if those are lacking, the sender is screwing up somehow. — torek, Apr 20 '12 at 17:26
@Shane - I may get there, but I'd be beyond stunned if the TCP implementation is borked in Linux. That seems reeeeally unlikely. — Russ, Apr 20 '12 at 17:37
@torek - Thanks for the tip. Sifting through all of that is seriously stretching my TCP know-how... there is a lot of stuff going on in TCP through all 50x transactions. That said, I'm fiddling with wireshark now to try and do precisely that... even though I am certain (hopeful?) that it is something much simpler at a higher level!! — Russ, Apr 20 '12 at 17:47
Do you get similar failures with a smaller number of threads? — Russell Borogove, Apr 20 '12 at 21:14
@RussellBorogove - For small numbers (< ~ 20) it did not happen, but it was somewhat random. — Russ, Apr 22 '12 at 16:25
@torek - tcpdump captures with wireshark viewing worked well to see what was going one... at least to some level! See my answer. I'm still clueless as to why some of the transmissions were getting held up, though. If out of my hands entirely, then I can't do anything. I'm mostly curious if there is something I can do in my code to help prevent such problems. — Russ, Apr 22 '12 at 16:27

score 6 · Accepted Answer · edited May 23 '17 at 12:03

Answering my own question...

The short answer is that, with TCP alone, the client has no way of knowing whether the intended recipient has actually received the bytes sent.

ie: It doesn't matter whether the client "happily" sent the bytes... even with TCP they may never arrive and you definitely have no knowledge as to when they will get to the intended recipient. Not without building in some acknowledgement into the application layer, anyway.

For my particular case, it turns out that the bytes the client sent DID actually arrive at the server, but took ~ 30s (!!!) to arrive, by which time both client and server application protocol code had timed out.

Views of the client and server side logs (for one failed connection) are here:

Those images are wireshark views of one particular TCP stream from the tcpdump capture files. You can see that there were a whole lot of re-transmissions occurring. What was the root cause driving the need for these re-transmissions? I have absolutely no idea (but would love to know!).

The data arrived at the server in the 2nd last entry (#974), ~30s after it was sent, and there were a large number of re-transmission attempts in between. If curious about server-side #793, this is an attempt by my application-layer protocol to send a message back to the client saying "timed out waiting for more data... where is it?".

In addition to inherent delays, one of the reasons the data was not appearing in the tcpdump logs at the server also seems to be my usage of tcpdump. In short: make sure to Ctrl-C out of the tcpdump capture before looking at the capture file (that created with the -w switch), as it seems to make a big difference as to what you see in the file. I expect this is a flush/sync'ing issue, but am guessing. However, without Ctrl-C I was definitely missing data.

More detail for future reference...

Although you often read/hear that the TCP will:

Guarantee that your packets will arrive (vs UDP, which doesn't)
Guarantee that your packets will arrive in order

it is apparent/obvious that the first is not actually true at all. TCP will do it's best to get your bytes to the intended recipient (including retrying for a LONG time), but this is not a guarantee, whether or not the send man page indicates for the send return value that "On success, these calls return the number of characters sent". The latter is not true and is highly misleading (see below).

The root of this comes mostly from the way that the various socket calls (send in particular) behave and how they interacts with the TCP/IP stack of the operating system...

On the sending side of a TCP exchange, the progression is quite simple. First you connect() and then you send().

connect() returning successfully definitely means that you were able to establish a connection to the server, so you at least know that at this time the server was there and listening (ie: the 3-part TCP opening handshake was successful).

For 'send`, although the documentation for the call indicates that the return value (if positive) is the "number of [bytes] sent", this is just plain wrong. All that the return value tells you is the number of bytes that the TCP stack in your underlying OS accepted into its outgoing buffer. After this point, the OS will try its best to deliver those bytes to the recipient that you initially made a connection with. But this may never happen, so it does not mean you can count on those bytes being sent! Somewhat surprisingly, there is also no real way to even determine whether this did (or did not!) happen, at least at the TCP socket layer, even though TCP has built in ACK messages. To verify full receipt of your sent bytes, you need to add some sort of acknowledgement at the application layer. nos has a great answer in another question that talks a bit about this.

Addendum...

One interesting dilemma I'm left with here is whether or not I need to build in some retry capability into my application-layer protocol. Currently it seems like, in the event of a timeout waiting for data at the server, it would be beneficial to close the connection and open a new one with the same request. It seems this way because the low level TCP retries were not successful, but in the mean time there were other client-side threads that were getting through in good time. This feels horribly wrong, though... you would think that the TCP retries should be sufficient. But they weren't. I need to look into the root cause of the TCP issues to resolve this.

"it is apparent/obvious that the first is not actually true at all." - well what did you expect? I can simply unplug the cable and the packets will never arrive... you should read stevens' networking bible — Karoly Horvath, Apr 22 '12 at 16:28
@Karoly - That is what I meant by the "obvious" part. :) I did plan on including the unplugged cable example, but my answer was long enough already. Also, for my case it was not relevant (connection was intact with other connections working fine) and the bytes *still* were not arriving. Which was (and is) odd. — Russ, Apr 22 '12 at 16:37
@KarolyHorvath - The other part is that it is misleading to say of the `send` return value that "On success, these calls return the number of characters sent". That is a direct quote from the [send man page](http://linux.die.net/man/2/send). This is clearly not the case (although I did actually know that) and I wanted to talk about it in my answer. It makes it sound like it blocks until those bytes are confirmed sent, which is dead wrong. — Russ, Apr 22 '12 at 16:42
the bytes are sent to the kernel network layer.. that's all it does. and and ACK only means that the kernel network layer received the bytes, not the app, that's the reason you cannot query this info.. it doesn't have real value. and read the bible ;) — Karoly Horvath, Apr 22 '12 at 16:49
@Karoly - I added a clarification to the answer on the misleading docs on `send` return value. It is unfortunate that the send documentation doesn't say "returns # of bytes received into kernel network layer's outgoing buffer". I will have to check out that Stevens book, thanks. :) Also, the TCP protocol does have ACK messages, it just seems they don't get used for anything other than the kernel's retry attempts. — Russ, Apr 22 '12 at 17:04
This is a little surprising. What was the physical layer link between the client and the server? TCP overhead is IIRC something like 48-64 bytes per packet, so in total you were sending something like 20Kbytes on the wire including all the low-level SYN/ACK. — Russell Borogove, Apr 22 '12 at 18:05
Retransmits generally occur in one of two cases: time out waiting for an ACK, or a system using selective ACK (SACK). I know some systems have had various SACK-related bugs in the past; haven't looked at your traces to see if there are any signs of those. — torek, Apr 23 '12 at 05:51
@RussellBorogove - Physical network description is simply that it starts in my home office, through router, through the intertubes from eastern North America to the west, arriving at the server machine in Vegas where my server is running as a linux 3.0 guest in a KVM host. Hmm... I will now tinker with the network card emulation there... hadn't thought of that. — Russ, Apr 23 '12 at 15:26
@torek - Nope no SACK packets that I can see (unless they are non-obvious) — Russ, Apr 23 '12 at 15:28

score 4 · Answer 2 · answered Apr 22 '12 at 16:48

4

You are sending very small numbers of bytes, so you're possibly falling foul of the Nagle algorithm, which will hold back the data you wish to send until a sensible amount of data has been buffered and is ready to be transmitted.

Try adding the following line once you've created your socket, and before you send any data:

sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, True)

But be aware that by doing so you'll introduce far more communication overhead.

answered Apr 22 '12 at 16:48

Steve Mayne

22,285
4
49
49

1

*Very* interesting, thanks!! I tried it and the behaviour did change substantially, but I still got occasional server-side timeouts (10s... no tcp dump snooping yet, but I'm betting on similar TCP-retry issues). I certainly got more successful attempts (sometimes all 50 arrived intact, which NEVER happened before!), but I also occasionally see fragmented packets... eg: leading bytes missing. I will have to read up on the Nagle algorithm!! – Russ Apr 22 '12 at 17:24
That's surprising; I would expect de-Nagling to make things worse, not better. The maximum hold-back time for Nagle is on the order of 200ms IIRC. – Russell Borogove Apr 23 '12 at 17:59
1

That might be true Russell - although removing Nagle will always transmit the data as soon as it's pushed, so it shouldn't make it any worse. – Steve Mayne Apr 23 '12 at 19:02

score 1 · Answer 3 · answered Apr 20 '12 at 17:01

1

You have to be very careful because (due to buffering) send and recv may not send or receive as much data as you expect "should" be available. You also have to be very careful to that any thread can block at any time, even if it "should" be able to receive as much data as you think was sent.

answered Apr 20 '12 at 17:01

ddyer

1,792
19
26

Thanks, but I don't understand your answer. Are you talking about the fact that `send` doesn't send all bytes you give it (returning how many it did send) and that `recv` doesn't necessarily read as many bytes as you ask for? If so, both ends of the code deal with this. What is happening is the opposite of this problem. `recv` is blocking on the server, waiting for bytes to arrive, and tcpdump is showing that they never do arrive, despite being (apparently) sent. Which thread blocking are you talking about? – Russ Apr 20 '12 at 17:12
@nos - Thanks. I did have the `send`/`recv` loops in place. The issue was all bytes received by the sending TCP stack did not seem to be arriving. Turns out they were, though... see my answer. I also referenced an answer you made to another question... thx for that, too! – Russ Apr 22 '12 at 16:24

Why would TCP packets consistently fail to arrive at the server when the client happily sent them?

3 Answers3