I have been unsuccessfully trying to apply the TCP Tuning techniques discussed at http://fasterdata.es.net/host-tuning/
As you'll see when you read through the entire question, sometimes the es.net tuning guidelines are effective for me, other times they are not, and I cannot yet figure out what the differentiating factor(s) are.
I have a benchmark laboratory environment set up, with the following:
- a Linux machine, running Ubuntu 14.04
- a Mac OS X machine, running 10.6.8
These two machines are connected over a high-speed, high-bandwidth internal network which I am using for test purposes.
My primary tools for analysis at this point have been iperf, pchar, tcpdump, and Wireshark.
To start with, I run 'iperf -s' on my Mac, and 'iperf -c' on my Linux machine, and I reliably and reproducibly get measured bandwidth of appx 940 Mbps, which makes sense to me because I believe my machines are connected via a 1 Gbps network.
I confirm these measurements with 'pchar'.
Then I artificially introduce high latency into this connection, by doing:
tc qdisc add dev eth0 root netem delay 78ms
As soon as I do this, both iperf and pchar report that bandwidth plummets to a range about 75-100 Mbps.
Interesting, the numbers also being bouncing around, a lot. But they are uniformly awful.
I believe this to match the problem described here: http://fasterdata.es.net/host-tuning/background/ and so I wish to observe that I can (somewhat) address this issue by tuning the TCP stacks on the two machines.
On the Mac, I use 'sysctl -w' to set
net.inet.tcp.win_scale_factor=8
kern.ipc.maxsockbuf=4194304
net.inet.tcp.recvspace=2097152
Attempting to set kern.ipc.maxsockbuf to a higher value is refused by the operating system as "Result too large". This may be a limitation of this version of Mac OS X as described here: https://discussions.apple.com/thread/2581395 (I have not yet tried the complicated workaround to this limitation described here: https://www.myricom.com/software/myri10ge/391-how-can-i-restore-the-socket-buffer-sizes-in-macosx-10-6.html)
Meanwhile, on the Linux machine, I use 'sysctl -w' to set
net.core.wmem_max=16777216
net.ipv4.tcp_wmem = 4096 8388608 16777216
net.core.rmem_max=16777216
net.ipv4.tcp_rmem = 4096 8388608 16777216
However, none of this tuning seems to change the numbers reported by iperf.
After all this tuning is in effect, iperf still reports measured bandwidth of about 90 Mbps, as far as I can tell unchanged from the values before I did the tuning changes.
I have captured packet traces of this configuration using tcpdump, and looked at them with Wireshark, and as far as I can tell the ACK messages flowing from the Mac back to the Linux machine indicate a window size of nearly 4MB.
Yet the packet traces appear to tell me that the Linux machine is unwilling to send more than about 32K of unacknowledged data at a time, and the "bytes in flight" never rises beyond that.
The packet traces do not show any evidence of lost packets, such as retransmission messages; however, they do show some evidence of packet reordering at times.
What am I doing wrong? Why are the TCP tuning techniques described at http://fasterdata.es.net/host-tuning/ not working for me?
UPDATE:
I used 'tcptrace -lW' on a tcpdump packet trace of one such run, and the results are displayed below.
TCP connection 3:
host e: *****:55706
host f: *****:5001
complete conn: yes
first packet: Tue Sep 8 07:48:35.569180 2015
last packet: Tue Sep 8 07:48:55.823746 2015
elapsed time: 0:00:20.254566
total packets: 112524
filename: sample.pcap
e->f: f->e:
total packets: 91306 total packets: 21218
ack pkts sent: 91305 ack pkts sent: 21218
pure acks sent: 2 pure acks sent: 21216
sack pkts sent: 0 sack pkts sent: 820
dsack pkts sent: 0 dsack pkts sent: 0
max sack blks/ack: 0 max sack blks/ack: 3
unique bytes sent: 131989528 unique bytes sent: 0
actual data pkts: 91303 actual data pkts: 0
actual data bytes: 132057584 actual data bytes: 0
rexmt data pkts: 47 rexmt data pkts: 0
rexmt data bytes: 68056 rexmt data bytes: 0
zwnd probe pkts: 0 zwnd probe pkts: 0
zwnd probe bytes: 0 zwnd probe bytes: 0
outoforder pkts: 0 outoforder pkts: 0
pushed data pkts: 435 pushed data pkts: 0
SYN/FIN pkts sent: 1/1 SYN/FIN pkts sent: 1/1
req 1323 ws/ts: Y/Y req 1323 ws/ts: Y/Y
adv wind scale: 9 adv wind scale: 6
req sack: Y req sack: Y
sacks sent: 0 sacks sent: 820
urgent data pkts: 0 pkts urgent data pkts: 0 pkts
urgent data bytes: 0 bytes urgent data bytes: 0 bytes
mss requested: 1460 bytes mss requested: 1460 bytes
max segm size: 1448 bytes max segm size: 0 bytes
min segm size: 24 bytes min segm size: 0 bytes
avg segm size: 1446 bytes avg segm size: 0 bytes
max win adv: 29696 bytes max win adv: 3728256 bytes
min win adv: 29696 bytes min win adv: 3116992 bytes
zero win adv: 0 times zero win adv: 0 times
avg win adv: 29696 bytes avg win adv: 3725120 bytes
max owin: 695209 bytes max owin: 1 bytes
min non-zero owin: 1 bytes min non-zero owin: 1 bytes
avg owin: 129597 bytes avg owin: 1 bytes
wavg owin: 186185 bytes wavg owin: 0 bytes
initial window: 13056 bytes initial window: 0 bytes
initial window: 10 pkts initial window: 0 pkts
ttl stream length: 131989528 bytes ttl stream length: 0 bytes
missed data: 0 bytes missed data: 0 bytes
truncated data: 0 bytes truncated data: 0 bytes
truncated packets: 0 pkts truncated packets: 0 pkts
data xmit time: 20.095 secs data xmit time: 0.000 secs
idletime max: 81.0 ms idletime max: 79.8 ms
I don't understand why the avg owin remains so small.
I feel like this is related to the congestion window, but I don't understand why the congestion window is limiting my throughput.
I used the Linux 'ss' tool to observe the cwnd during my run, and it indeed shows me that the cwnd is not increasing:
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 696600 N.N.N.N:55708 N.N.N.N:5001
skmem:(r0,rb8388608,t482,tb2097152,f3584,w2097664,o0,bl0) cubic wscale:8,9 rto:280 rtt:80/3 mss:1448 cwnd:486 ssthresh:328 send 70.4Mbps unacked:482 retrans:0/89 rcv_space:29200
...
ESTAB 0 626984 N.N.N.N:55708 N.N.N.N:5001
skmem:(r0,rb8388608,t243,tb2097152,f216832,w1884416,o0,bl0) cubic wscale:8,9 rto:280 rtt:80/3 mss:1448 cwnd:243 ssthresh:240 send 35.2Mbps unacked:243 retrans:0/231 rcv_space:29200
...
ESTAB 0 697936 N.N.N.N:55708 N.N.N.N:5001
skmem:(r0,rb8388608,t289,tb2097152,f3584,w2097664,o0,bl0) cubic wscale:8,9 rto:276 rtt:79.5/3 mss:1448 cwnd:290 ssthresh:240 send 42.3Mbps unacked:290 retrans:0/231 rcv_space:29200
Are there any clues in this 'ss' output as to why my throughput is so low?
UPDATE 2
I got access to some additional hardware resources.
Specifically, I was able to locate an alternate Linux box and use it to test against my Mac. Once again, on the Linux box I simulated a high-latency connection by doing:
tc qdisc add dev eth0 root netem delay 78ms
On this box, I then again followed the instructions from http://fasterdata.es.net/host-tuning/ to set the variables
net.core.wmem_max=16777216
net.ipv4.tcp_wmem = 4096 8388608 16777216
net.core.rmem_max=16777216
net.ipv4.tcp_rmem = 4096 8388608 16777216
And then 'iperf' reported measured throughput of 911 Megabits-per-second between the two machines.
So, I have now successfully applied the TCP Tuning techniques to a Linux box.
BUT, I remain puzzled about why the exact same techniques did not work on the other Linux box.
There are at least two important differences in the second setup:
- In my failed experiment, the Linux machine was a Virtual Machine, while in my successful experiment, the Linux machine was running on bare metal.
- In my failed experiment, the Linux VM and my Mac were on different subnets, separated by a router, while in my successful experiment, the Linux machine and my Mac were on the same subnet.
So, perhaps the techniques listed in http://fasterdata.es.net/host-tuning/ do not work if the Linux machine being tuned is a Virtual Machine, perhaps because the Linux networking stack in the VM is dramatically different than the Linux networking stack on a bare metal Linux machine, and hence doesn't respond to the same tuning.
Or, perhaps the techniques listed in http://fasterdata.es.net/host-tuning/ do not work if there is a router between the endpoints (although, since the whole point of http://fasterdata.es.net/host-tuning/ is for Internet-scale WAN tuning, surely there must nearly ALWAYS be a router between the endpoints, I would think?).
Or, perhaps there is some other factor which differs between my two experiments, which is the cause of the horrendous throughput in the first case (but what might that be?)