C++ socket programming: maximize throughput/bandwidth on localhost (I only get 3 Gbit/s instead of 23GBit/s)

Question

I want to create a C++ server/client that maximizes the throughput over TCP socket communication on my localhost. As a preparation, I used iperf to find out what the maximum bandwidth is on my i7 MacBookPro.

------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  256 KByte (default)
------------------------------------------------------------
[  4] local 127.0.0.1 port 5001 connected with 127.0.0.1 port 51583
[  4]  0.0-120.0 sec   329 GBytes  23.6 Gbits/sec

Without any tweaking, ipref showed me that I can reach at least 23.2 GBit/s. Then I did my own C++ server/client implementation, you can find the full code here: https://gist.github.com/1116635

I that code I basically transfer a 1024bytes int array with each read/write operation. So my send loop on the server looks like this:

   int n;

   int x[256];

   //fill int array
   for (int i=0;i<256;i++)
   {
       x[i]=i;
   }

   for (int i=0;i<(4*1024*1024);i++)
   {
       n = write(sock,x,sizeof(x));
       if (n < 0) error("ERROR writing to socket");
   }

My receive loop on the client looks like this:

int x[256]; 

for (int i=0;i<(4*1024*1024);i++)
{
    n = read(sockfd,x,((sizeof(int)*256)));
    if (n < 0) error("ERROR reading from socket");
}

As mention in the headline, running this (compiled with -O3) results in the following execution time which is about 3 GBit/s:

./client 127.0.0.1 1234
Elapsed time for Reading 4GigaBytes of data over socket on localhost: 9578ms

Where do I loose the bandwidth, what am I doing wrong? Again, the full code can be seen here: https://gist.github.com/1116635

Any help is appreciated!

Can you confirm whether the stat "23GBit/sec" is only for the actual data or is it including the TCP, IP and the ethernet headers ? — Arunmu, Jul 31 '11 at 09:58
it's not physical bandwith but the max throughput available on the system, but's that's still a bandwith — Karoly Horvath, Jul 31 '11 at 10:03
You ought to take a moment and wonder how you get 23.2 gbits/sec on a machine that has only a 1 gbit/sec Ethernet interface. Ought to be enough to realize that it really doesn't matter. Shared memory is going to be a lot faster. — Hans Passant, Jul 31 '11 at 10:21
lol, a 10 second test is completely invalid. Try again for a longer period of time, start at least at 100 seconds. — Steve-o, Jul 31 '11 at 10:31
obviously localhost(loopback) has nothing to do with the ethernet interface. — Karoly Horvath, Jul 31 '11 at 10:53
@Steve-o: I repeated it with a 120sec period. Results are the same, I can also see the 3 gigabyte/sec rate in my network monitor. — Christian, Jul 31 '11 at 10:55
Christian, do you think that you could repost the code? I need it for a small project, in order to measure the throughput to a remote DSP board. — Nick, Dec 12 '12 at 18:55

CAFxX · Answer 1 · 2011-07-31T09:49:15.097

5

Use larger buffers (i.e. make less library/system calls)
Use asynchronous APIs
Read the documentation (the return value of read/write is not simply an error condition, it also represents the number of bytes read/written)

edited Jul 31 '11 at 09:49

answered Jul 31 '11 at 09:44

CAFxX

28,060
6
41
66

3

would asynchronous api really help in this case ? This is just a simple one to one communication !! – Arunmu Jul 31 '11 at 09:48
1

@ArunMu, asynchronous APIs *may* produce a *marginally* faster result, because you don't incur the penalty waiting of your `if` statement and your `write` call to execute, during which time no data will be sent. The difference won't be as large as you think. – foxy Jul 31 '11 at 09:50
1

I highly doubt it's going to be any faster. Could someone check this and report elapsed time and cpu usage stats? – Karoly Horvath Jul 31 '11 at 09:55
2

@ArunMu the most important thing is undoubtedly lowering the number of system calls (that's why it is the first point in the list). I listed asynchronous APIs because 1) they help performance even in such simple cases (if properly used) 2) it's good to learn them – CAFxX Jul 31 '11 at 09:58
1

@yi_H no microbenchmarks available but see e.g. http://blog.lighttpd.net/articles/2006/11/14/linux-aio-and-large-files – CAFxX Jul 31 '11 at 10:02
@yi_H I have performed a few simple tests and indeed the number of system calls seems to be the issue. – cnicutar Jul 31 '11 at 10:09
@CaFxX: I think this is not relevant here (no disk I/O), the summary says *"No matter what, large files or small files, when you disk start to suffer from seeking around AIO will give you, at least in my setup, 80% more throughput."* – Karoly Horvath Jul 31 '11 at 10:17

phihag · Answer 2 · 2011-07-31T09:57:15.763

3

You can use strace -f iperf -s localhost to find out what iperf is doing differently. It seems that it's using significantly larger buffers (131072 Byte large with 2.0.5) than you.

Also, iperf uses multiple threads. If you have 4 CPU cores, using two threads on client and server will will result in approximately doubled performance.

edited Jul 31 '11 at 09:57

answered Jul 31 '11 at 09:50

phihag

278,196
72
453
469

score 3 · Accepted Answer · answered Jul 31 '11 at 10:06

3

My previous answer was mistaken. I have tested your programs and here are the results.

If I run the original client, I get 0m7.763s
If I use a buffer 4 times as large, I get 0m5.209s
With a buffer 8 times as the original I get 0m3.780s

I only changed the client. I suspect more performance can be squeezed if you also change the server.

The fact that I got radically different results than you did (0m7.763s vs 9578ms) also suggests this is caused by the number of system calls performed (as we have different processors..). To squeeze even more performance:

Use scater-gather I/O (readv and writev)
Use zero-copy mechanisms: splice(2), sendfile(2)

answered Jul 31 '11 at 10:06

cnicutar

178,505
25
365
392

have you checked the result of read calls? The default max buffer size for TCP is 128k on most systems. – Karoly Horvath Jul 31 '11 at 10:37
@yi_H Yes, I did. I got straight `8192` every time. I also changed the server to use the same buffer and got it down to `0m2.222s`. – cnicutar Jul 31 '11 at 10:38
Thank you for your feedback. When you increased the buffer size on the client side, did you double-check that you actually transferring the same overall amount of data? I did the same experiment and with a 8x buffer on the client it came down to ~6.5 sec. But no matter if it's 3.7sec or 6.5sec that's still far away from 23.2 GBits/s... – Christian Jul 31 '11 at 10:40
@Christian I am using `int x[2 * 1024];` and reading `512 * 1024` times. – cnicutar Jul 31 '11 at 10:41
@cnicutar: With that setting my network monitor tells me that I'm transferring 2GB during the experiment (instead of 4GB otherwise)... – Christian Jul 31 '11 at 10:47
@Christian Modify the server to use the same buffer. – cnicutar Jul 31 '11 at 10:49
Ok, thank you that did the trick. I now have about 2 gigabyte per second which is kind of close. One last question though: how do I determine/change the maximum buffer size from a operating system perspective? – Christian Jul 31 '11 at 11:09
@Christian I'm not sure. It depends how much the OS is willing to copy to/from userspace without blocking. Again, look into `readv` and `writev`. – cnicutar Jul 31 '11 at 11:19
@Christian : From what I know the buffer size maintained by OS is varied by the OS itself i.e OS may resize the TCP buffer size as per the requirement..in order to make sure there is no packet loss. But for UDP, the buffer is constant , and when data is not read from the udp socket fast enough..then it might lead to packet accumulation and thus packet loss. for linux the command to modify udp buffer size would be "sysctl -w net.core.rmem_max=8388608" – Arunmu Jul 31 '11 at 14:59

score 1 · Answer 4 · answered Jul 31 '11 at 10:48

1

If you really want to get max performance use mmap + splice/sendfile, and for localhost communication use unix domain stream sockets (AF_LOCAL).

answered Jul 31 '11 at 10:48

Karoly Horvath

94,607
11
117
176

C++ socket programming: maximize throughput/bandwidth on localhost (I only get 3 Gbit/s instead of 23GBit/s)

4 Answers4