Sometimes unable to tune Linux TCP stack for improved high-latency performance

Question

I have been unsuccessfully trying to apply the TCP Tuning techniques discussed at http://fasterdata.es.net/host-tuning/

As you'll see when you read through the entire question, sometimes the es.net tuning guidelines are effective for me, other times they are not, and I cannot yet figure out what the differentiating factor(s) are.

I have a benchmark laboratory environment set up, with the following:

a Linux machine, running Ubuntu 14.04
a Mac OS X machine, running 10.6.8

These two machines are connected over a high-speed, high-bandwidth internal network which I am using for test purposes.

My primary tools for analysis at this point have been iperf, pchar, tcpdump, and Wireshark.

To start with, I run 'iperf -s' on my Mac, and 'iperf -c' on my Linux machine, and I reliably and reproducibly get measured bandwidth of appx 940 Mbps, which makes sense to me because I believe my machines are connected via a 1 Gbps network.

I confirm these measurements with 'pchar'.

Then I artificially introduce high latency into this connection, by doing:

tc qdisc add dev eth0 root netem delay 78ms

As soon as I do this, both iperf and pchar report that bandwidth plummets to a range about 75-100 Mbps.

Interesting, the numbers also being bouncing around, a lot. But they are uniformly awful.

I believe this to match the problem described here: http://fasterdata.es.net/host-tuning/background/ and so I wish to observe that I can (somewhat) address this issue by tuning the TCP stacks on the two machines.

On the Mac, I use 'sysctl -w' to set

net.inet.tcp.win_scale_factor=8
kern.ipc.maxsockbuf=4194304
net.inet.tcp.recvspace=2097152

Attempting to set kern.ipc.maxsockbuf to a higher value is refused by the operating system as "Result too large". This may be a limitation of this version of Mac OS X as described here: https://discussions.apple.com/thread/2581395 (I have not yet tried the complicated workaround to this limitation described here: https://www.myricom.com/software/myri10ge/391-how-can-i-restore-the-socket-buffer-sizes-in-macosx-10-6.html)

Meanwhile, on the Linux machine, I use 'sysctl -w' to set

net.core.wmem_max=16777216
net.ipv4.tcp_wmem = 4096 8388608 16777216
net.core.rmem_max=16777216
net.ipv4.tcp_rmem = 4096 8388608 16777216

However, none of this tuning seems to change the numbers reported by iperf.

After all this tuning is in effect, iperf still reports measured bandwidth of about 90 Mbps, as far as I can tell unchanged from the values before I did the tuning changes.

I have captured packet traces of this configuration using tcpdump, and looked at them with Wireshark, and as far as I can tell the ACK messages flowing from the Mac back to the Linux machine indicate a window size of nearly 4MB.

Yet the packet traces appear to tell me that the Linux machine is unwilling to send more than about 32K of unacknowledged data at a time, and the "bytes in flight" never rises beyond that.

The packet traces do not show any evidence of lost packets, such as retransmission messages; however, they do show some evidence of packet reordering at times.

What am I doing wrong? Why are the TCP tuning techniques described at http://fasterdata.es.net/host-tuning/ not working for me?

UPDATE:

I used 'tcptrace -lW' on a tcpdump packet trace of one such run, and the results are displayed below.

TCP connection 3:
    host e:        *****:55706
    host f:        *****:5001
    complete conn: yes
    first packet:  Tue Sep  8 07:48:35.569180 2015
    last packet:   Tue Sep  8 07:48:55.823746 2015
    elapsed time:  0:00:20.254566
    total packets: 112524
    filename:      sample.pcap
   e->f:                  f->e:
     total packets:         91306           total packets:         21218      
     ack pkts sent:         91305           ack pkts sent:         21218      
     pure acks sent:            2           pure acks sent:        21216      
     sack pkts sent:            0           sack pkts sent:          820      
     dsack pkts sent:           0           dsack pkts sent:           0      
     max sack blks/ack:         0           max sack blks/ack:         3      
     unique bytes sent: 131989528           unique bytes sent:         0      
     actual data pkts:      91303           actual data pkts:          0      
     actual data bytes: 132057584           actual data bytes:         0      
     rexmt data pkts:          47           rexmt data pkts:           0      
     rexmt data bytes:      68056           rexmt data bytes:          0      
     zwnd probe pkts:           0           zwnd probe pkts:           0      
     zwnd probe bytes:          0           zwnd probe bytes:          0      
     outoforder pkts:           0           outoforder pkts:           0      
     pushed data pkts:        435           pushed data pkts:          0      
     SYN/FIN pkts sent:       1/1           SYN/FIN pkts sent:       1/1      
     req 1323 ws/ts:          Y/Y           req 1323 ws/ts:          Y/Y      
     adv wind scale:            9           adv wind scale:            6      
     req sack:                  Y           req sack:                  Y      
     sacks sent:                0           sacks sent:              820      
     urgent data pkts:          0 pkts      urgent data pkts:          0 pkts 
     urgent data bytes:         0 bytes     urgent data bytes:         0 bytes
     mss requested:          1460 bytes     mss requested:          1460 bytes
     max segm size:          1448 bytes     max segm size:             0 bytes
     min segm size:            24 bytes     min segm size:             0 bytes
     avg segm size:          1446 bytes     avg segm size:             0 bytes
     max win adv:           29696 bytes     max win adv:         3728256 bytes
     min win adv:           29696 bytes     min win adv:         3116992 bytes
     zero win adv:              0 times     zero win adv:              0 times
     avg win adv:           29696 bytes     avg win adv:         3725120 bytes
     max owin:             695209 bytes     max owin:                  1 bytes
     min non-zero owin:         1 bytes     min non-zero owin:         1 bytes
     avg owin:             129597 bytes     avg owin:                  1 bytes
     wavg owin:            186185 bytes     wavg owin:                 0 bytes
     initial window:        13056 bytes     initial window:            0 bytes
     initial window:           10 pkts      initial window:            0 pkts 
     ttl stream length: 131989528 bytes     ttl stream length:         0 bytes
     missed data:               0 bytes     missed data:               0 bytes
     truncated data:            0 bytes     truncated data:            0 bytes
     truncated packets:         0 pkts      truncated packets:         0 pkts 
     data xmit time:       20.095 secs      data xmit time:        0.000 secs 
     idletime max:           81.0 ms        idletime max:           79.8 ms

I don't understand why the avg owin remains so small.

I feel like this is related to the congestion window, but I don't understand why the congestion window is limiting my throughput.

I used the Linux 'ss' tool to observe the cwnd during my run, and it indeed shows me that the cwnd is not increasing:

State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
ESTAB      0      696600        N.N.N.N:55708         N.N.N.N:5001    
     skmem:(r0,rb8388608,t482,tb2097152,f3584,w2097664,o0,bl0) cubic wscale:8,9 rto:280 rtt:80/3 mss:1448 cwnd:486 ssthresh:328 send 70.4Mbps unacked:482 retrans:0/89 rcv_space:29200
...

    ESTAB      0      626984        N.N.N.N:55708         N.N.N.N:5001    
     skmem:(r0,rb8388608,t243,tb2097152,f216832,w1884416,o0,bl0) cubic wscale:8,9 rto:280 rtt:80/3 mss:1448 cwnd:243 ssthresh:240 send 35.2Mbps unacked:243 retrans:0/231 rcv_space:29200
...
ESTAB      0      697936        N.N.N.N:55708         N.N.N.N:5001    
     skmem:(r0,rb8388608,t289,tb2097152,f3584,w2097664,o0,bl0) cubic wscale:8,9 rto:276 rtt:79.5/3 mss:1448 cwnd:290 ssthresh:240 send 42.3Mbps unacked:290 retrans:0/231 rcv_space:29200

Are there any clues in this 'ss' output as to why my throughput is so low?

UPDATE 2

I got access to some additional hardware resources.

Specifically, I was able to locate an alternate Linux box and use it to test against my Mac. Once again, on the Linux box I simulated a high-latency connection by doing:

tc qdisc add dev eth0 root netem delay 78ms

On this box, I then again followed the instructions from http://fasterdata.es.net/host-tuning/ to set the variables

net.core.wmem_max=16777216
net.ipv4.tcp_wmem = 4096 8388608 16777216
net.core.rmem_max=16777216
net.ipv4.tcp_rmem = 4096 8388608 16777216

And then 'iperf' reported measured throughput of 911 Megabits-per-second between the two machines.

So, I have now successfully applied the TCP Tuning techniques to a Linux box.

BUT, I remain puzzled about why the exact same techniques did not work on the other Linux box.

There are at least two important differences in the second setup:

In my failed experiment, the Linux machine was a Virtual Machine, while in my successful experiment, the Linux machine was running on bare metal.
In my failed experiment, the Linux VM and my Mac were on different subnets, separated by a router, while in my successful experiment, the Linux machine and my Mac were on the same subnet.

So, perhaps the techniques listed in http://fasterdata.es.net/host-tuning/ do not work if the Linux machine being tuned is a Virtual Machine, perhaps because the Linux networking stack in the VM is dramatically different than the Linux networking stack on a bare metal Linux machine, and hence doesn't respond to the same tuning.

Or, perhaps the techniques listed in http://fasterdata.es.net/host-tuning/ do not work if there is a router between the endpoints (although, since the whole point of http://fasterdata.es.net/host-tuning/ is for Internet-scale WAN tuning, surely there must nearly ALWAYS be a router between the endpoints, I would think?).

Or, perhaps there is some other factor which differs between my two experiments, which is the cause of the horrendous throughput in the first case (but what might that be?)

If possible, switch to testing between two machines running the same OS. This will make it much easier to figure out the right tuning for each OS individually by keeping the two machines configured identically during the testing. Otherwise, you may have to find more than one right knob at the same time, which can take forever. — David Schwartz, Sep 08 '15 at 17:49
@DavidSchwartz Thanks for the suggestion. You'll notice I updated my question with more info. I don't think the Mac is the issue here because (a) both Linux boxes can achieve ~1 Gigabit-per-second throughput to the Mac when there is no delay and (b) the successfully tuned Linux box can also achieve ~1 Gigabit-per-second throughput to the Mac with delay present. But if I can find a 3rd Linux box I'll try Linux-to-Linux. Sadly, machine resources are limited and I must use what I have. — Bryan Pendleton, Sep 09 '15 at 01:44
I have successfully tuned several other Linux machines, all running on bare metal; I have come to believe that the problem is that the faster-data.es.net recommendations are inadequate for Linux Virtual Machines, because the hypervisor must itself be tuned and you cannot achieve that from within the Linux VM. — Bryan Pendleton, Sep 12 '15 at 15:49

Sometimes unable to tune Linux TCP stack for improved high-latency performance

0 Answers0