writing framed data without extra write() cost

Question

So I'm sending data on a TCP socket, prefixed with the size of data, as so:

write(socket, &length, sizeof(length));
write(socket, data, length);

(Note: I have wrapper writen functions as described in the Unix Network Programming book, and am checking for errors, etc. The above is just for the simplicity of this question).

Now, my experience is that breaking up data into multiple writes can cause significant slowdown. I have had success speeding things up by creating my own buffer, then sending out one big chunk.

However, in the above case data may be incredibly large (lets say 1 Gig). I don't want to create a buffer 1 Gig large + 4 bytes, just to be able to have one write() call. Is there any way of doing something akin to:

 write(socket, &length, data, sizeof(length) + length)

without paying the price of a large memory allocation ahead of time? I suppose I could just pre-allocate a chunk the size of write's buffer, and continuously send that (the below code has errors, namely, should be sending &chunk + 4 in some instances, but this is just the idea):

length += 4;

char chunk[buffer_size];
var total = 0;

while (total < length)
{
     if (total < 4)
     {
         memcpy(&chunk, &length, 4);
         total += 4;
     }

     memcpy(&chunk, data + total, min(buffer_size, length - total));
     write(sock, &chunk, min(buffer_size, length - total));

     total += min(buffer_size, length - total);
}

But in that case I don't know what write's buffer size actually is (is there an API to get it?) I also don't know if this is an appropriate solution.

I think your `if (total < 4)` clause can be moved out of the while loop. And having been moved out, it doesn't need the if wrapper anymore. — Randall Cook, Apr 07 '15 at 17:37

score 4 · Answer 1 · answered Apr 07 '15 at 17:53

4

There is an option to do this already. It will inform your network layer that you are going to send more data and you want to buffer rather than send it as soon as possible.

setsockopt(sock_descriptor, IPPROTO_TCP, TCP_CORK, (char *)&val, sizeof(val));

val is an int, and should be 0 or 1, with the "cork" on, your network layer will buffer things as much as possible, to only send full packets, you might want to "pop the cork" and "cork" again to handle the next batch of transmissions that you need to make on the socket.

Your idea is correct, this just saves you the trouble of implementing it, since it's already done in the network stack.

answered Apr 07 '15 at 17:53

LtWorf

7,286
6
31
45

This is looking like my best option. I'm also investigating the writev suggested – Francisco Ryan Tolmasky I Apr 07 '15 at 19:31
You should mark the question as answered and pick the answer you used. – LtWorf Apr 28 '15 at 18:33
I ended up sticking with my original strategy (turned out fastest in my tests of all three), but both this and the one below it (writev) are useful in different circumstances, so not sure what to do here. – Francisco Ryan Tolmasky I Apr 28 '15 at 20:43

score 4 · Answer 2 · answered Apr 07 '15 at 18:09

4

I suggest having a look at writev() (see man writev for full details).

This allows you to send multiple buffers in one go, with just one call. As a simple example, to send out two chunks in one go (one for length, one for data):

struct iovec bits[2];

/* First chunk is the length */
bits[0].iov_base = &length;
bits[0].iov_len = sizeof(length);

/* Second chunk is the payload */
bits[1].iov_base = data;
bits[1].iov_base = length;

/* Send two chunks at once */
writev(socket, bits, 2);

It can get more complicated if you need to use a variable number of chunks (you may need to allocate the array of struct iov dynamically), but the advantage is that, if your chunks are large, you can avoid copying them, and just manipulate pointer/length pairs, which are much smaller.

answered Apr 07 '15 at 18:09

psmears

26,070
4
40
48

Looking into this now, do you know if writev also needs to be looped in case it returns sent < total like write() does? Can't seem to find a lot of examples – Francisco Ryan Tolmasky I Apr 07 '15 at 19:31
@FranciscoRyanTolmaskyI: Yes, it works exactly the same as write() except with multiple buffers instead of one. – psmears Apr 07 '15 at 20:06
@FranciscoRyanTolmaskyl: If your writev() call is non-blocking, you'll experience partial sends. You'll have to manage collapsing out fully sent vectors and adjust iov_base and iov_len on the vector that was partially sent! – Joel Cunningham Apr 08 '15 at 01:29
@JoelCunningham: Yes, though you can often make the code simpler by doing the same for both (i.e. reducing iov_len partially on partially-sent blocks, and reducing it to 0 on fully-sent blocks). – psmears Apr 08 '15 at 07:58
@psmears even if its blocking it'll still do partial sends right? (possibly) – Francisco Ryan Tolmasky I Apr 09 '15 at 01:29
@FranciscoRyanTolmaskyI: It will behave just like `write`. Yes, it's best practice to cope with partial sends. – psmears Apr 09 '15 at 07:45

score 2 · Answer 3 · edited May 23 '17 at 12:32

I think you are on the right track with your spooled solution presented. I think buffer_size should be larger than that used internally by the network stack. This way, you minimize the amount of per-write overhead without having to allocate a giant buffer. In other words, by giving the underlying network subsystem more data than it can handle at once, it is free to run at its fastest speed, spending most of its time moving data, rather than waiting for more data to be provided.

The optimal buffer_size value might vary from system to system. I would start with 1MB and do some experiments up and down from there to see what works best. There might also be values you can extract and adjust with a sysctl call for the current internal buffer size used on your system. Read this for a suggested technique. You might also use something like getsockopt(..., SO_MAX_MSG_SIZE, ...), as explained here.

Ethernet packets can range up to about 64K in size, so perhaps anything larger than 64K is sufficient. Read about maximum transmission unit (MTU) sizes to get a sense of what the lowest layers of the network stack are doing, and don't forget that the MTU varies with the network interface, not the process or kernel.

Beware that the MTU can vary along the route from your server to the data's destination. You can use ifconfig or traceroute/tracepath to discover it. With networking, every link in the chain is weak. ;)

writing framed data without extra write() cost

3 Answers3