different ACK behaviors (slowing down throughput?)

Question

I'm running netio (http://freshmeat.net/projects/netio/) on one machine (opensolaris) and contacting two different Linux machines (both on 2.6.18-128.el5 ), machine A and machine B. Machine A has a network throughput of 10MB/sec with netio and machine B 100MB/sec with netio. On the open solaris I dtraced the connections and all the interactions look the same - same windows sizes on the receive and send, same ssthresh, same congestion window sizes, but the slow machine is sending and ACK for every 2 or 3 receives whereas the fast machine is sending an ACK every 12 receives. All three machines are on the same switch. Here is the Dtrace output: Fast Machine:

delta send   recd  
 (us) bytes  bytes  swnd snd_ws   rwnd rcv_ws   cwnd    ssthresh 
  122 1448 \      195200      7 131768      2 128872  1073725440 
   37 1448 \      195200      7 131768      2 128872  1073725440 
   20 1448 \      195200      7 131768      2 128872  1073725440 
   18 1448 \      195200      7 131768      2 128872  1073725440 
   18 1448 \      195200      7 131768      2 128872  1073725440 
   18 1448 \      195200      7 131768      2 128872  1073725440 
   18 1448 \      195200      7 131768      2 128872  1073725440 
   19 1448 \      195200      7 131768      2 128872  1073725440 
   18 1448 \      195200      7 131768      2 128872  1073725440 
   18 1448 \      195200      7 131768      2 128872  1073725440  
   57 1448 \      195200      7 131768      2 128872  1073725440
  171 1448 \      195200      7 131768      2 128872  1073725440    
   29  912 \      195200      7 131768      2 128872  1073725440   
   30      /    0 195200      7 131768      2 128872  1073725440

slow machine:

delta send   recd  
 (us) bytes  bytes  swnd snd_ws   rwnd rcv_ws   cwnd    ssthresh 
  161      /    0 195200     7 131768      2 127424   1073725440  
   52 1448 \      195200     7 131768      2 128872   1073725440 
   33 1448 \      195200     7 131768      2 128872   1073725440   
   11 1448 \      195200     7 131768      2 128872   1073725440   
  143      /    0 195200     7 131768      2 128872   1073725440   
   46 1448 \      195200     7 131768      2 130320   1073725440   
   31 1448 \      195200     7 131768      2 130320   1073725440   
   11 1448 \      195200     7 131768      2 130320   1073725440   
  157      /    0 195200     7 131768      2 130320   1073725440  
   46 1448 \      195200     7 131768      2 131768   1073725440 
   18 1448 \      195200     7 131768      2 131768   1073725440

Dtrace code

dtrace: 130717 drops on CPU 0
#!/usr/sbin/dtrace -s
#pragma D option quiet
#pragma D option defaultargs
inline int TICKS=$1;
inline string ADDR=$$2;
dtrace:::BEGIN
{
       TIMER = ( TICKS != NULL ) ?  TICKS : 1 ;
       ticks = TIMER;
       TITLE = 10;
       title = 0;
       walltime=timestamp;
       printf("starting up ...\n");
}
tcp:::send
/     ( args[2]->ip_daddr == ADDR || ADDR == NULL ) /
{
    nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
    delta= timestamp-walltime;
    walltime=timestamp;
    printf("%6d %8d \ %8s  %8d %8d %8d  %8d %8d %12d %12d %12d %8d %8d  %d  \n",
        delta/1000,
        args[2]->ip_plength - args[4]->tcp_offset,
        "",
        args[3]->tcps_swnd,
        args[3]->tcps_snd_ws,
        args[3]->tcps_rwnd,
        args[3]->tcps_rcv_ws,
        args[3]->tcps_cwnd,
        args[3]->tcps_cwnd_ssthresh,
        args[3]->tcps_sack_fack,
        args[3]->tcps_sack_snxt,
        args[3]->tcps_rto,
        args[3]->tcps_mss,
        args[3]->tcps_retransmit
      );
    flag=0;
    title--;
}
tcp:::receive
/ ( args[2]->ip_saddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] /
{
      delta=timestamp-walltime;
      walltime=timestamp;

      printf("%6d %8s / %8d  %8d %8d %8d  %8d %8d %12d %12d %12d %8d %8d  %d  \n",
        delta/1000,
        "",
        args[2]->ip_plength - args[4]->tcp_offset,
        args[3]->tcps_swnd,
        args[3]->tcps_snd_ws,
        args[3]->tcps_rwnd,
        args[3]->tcps_rcv_ws,
        args[3]->tcps_cwnd,
        args[3]->tcps_cwnd_ssthresh,
        args[3]->tcps_sack_fack,
        args[3]->tcps_sack_snxt,
        args[3]->tcps_rto,
        args[3]->tcps_mss,
        args[3]->tcps_retransmit
      );
    flag=0;
    title--;
}

Followup added to to include the number of unacknowledged bytes and it turns out the slow code does run up it's unacknowleged bytes until it hits the congestion window, where as the fast machine never hits it's congestion window. Here is the output from the slow machine when it's unacknowledged bytes hit the congestion window:

unack    unack    delta  bytes   bytes       send   recieve  cong       ssthresh
bytes    byte      us     sent   recieved    window window    window 
sent     recieved
139760      0     31     1448 \             195200  131768   144800   1073725440
139760      0     33     1448 \             195200  131768   144800   1073725440
144104      0     29     1448 \             195200  131768   146248   1073725440
145552      0     31          / 0           195200  131768   144800   1073725440
145552      0     41     1448 \             195200  131768   147696   1073725440
147000      0     30          / 0           195200  131768   144800   1073725440
147000      0     22     1448 \             195200  131768    76744        72400
147000      0     28          / 0           195200  131768    76744        72400
147000      0     18     1448 \             195200  131768    76744        72400
147000      0     26          / 0           195200  131768    76744        72400
147000      0     17     1448 \             195200  131768    76744        72400
147000      0     27          / 0           195200  131768    76744        72400
147000      0     18     1448 \             195200  131768    76744        72400
147000      0     56          / 0           195200  131768    76744        72400
147000      0     22     1448 \             195200  131768    76744        72400

dtrace code:


#!/usr/sbin/dtrace -s
#pragma D option quiet
#pragma D option defaultargs
inline int TICKS=$1;
inline string ADDR=$$2;
tcp:::send, tcp:::receive
/     ( args[2]->ip_daddr == ADDR || ADDR == NULL ) /
{
    nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
    delta= timestamp-walltime;
    walltime=timestamp;
    printf("%6d %6d %6d %8d \ %8s  %8d %8d %8d  %8d %8d %12d %12d %12d %8d %8d  %d  \n",
        args[3]->tcps_snxt - args[3]->tcps_suna ,
        args[3]->tcps_rnxt - args[3]->tcps_rack,
        delta/1000,
        args[2]->ip_plength - args[4]->tcp_offset,
        "",
        args[3]->tcps_swnd,
        args[3]->tcps_snd_ws,
        args[3]->tcps_rwnd,
        args[3]->tcps_rcv_ws,
        args[3]->tcps_cwnd,
        args[3]->tcps_cwnd_ssthresh,
        args[3]->tcps_sack_fack,
        args[3]->tcps_sack_snxt,
        args[3]->tcps_rto,
        args[3]->tcps_mss,
        args[3]->tcps_retransmit
      );
}
tcp:::receive
/ ( args[2]->ip_saddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] /
{
      delta=timestamp-walltime;
      walltime=timestamp;
      printf("%6d %6d %6d %8s / %-8d  %8d %8d %8d  %8d %8d %12d %12d %12d %8d %8d  %d  \n",
        args[3]->tcps_snxt - args[3]->tcps_suna ,
        args[3]->tcps_rnxt - args[3]->tcps_rack,
        delta/1000,
        "",
        args[2]->ip_plength - args[4]->tcp_offset,
        args[3]->tcps_swnd,
        args[3]->tcps_snd_ws,
        args[3]->tcps_rwnd,
        args[3]->tcps_rcv_ws,
        args[3]->tcps_cwnd,
        args[3]->tcps_cwnd_ssthresh,
        args[3]->tcps_sack_fack,
        args[3]->tcps_sack_snxt,
        args[3]->tcps_rto,
        args[3]->tcps_mss,
        args[3]->tcps_retransmit
      );
}

Now it still is a question as to why one machine falls behind and the other doesn't ...

I added code to include the unacknowledged byte and it turns out the slow connection does run up the number of unacknowledged bytes such that it hits its congestion window: — Kyle Hailey, May 17 '11 at 01:08
What kind of NIC/what drivers are on the slow box? Have you tried manually setting full-duplex/10/100 instead of (I'm assuming) auto-negotiate on that port? Any TX/RX errors on your switch (if it's managed)? — gravyface, May 17 '11 at 01:50
you know, the NICs are different makes in the two Linux boxes. I'll have to verify tomorrow their speed and duplex settings. This issue is happening with several different boxes. We just got a new AIX box which showed the same slow throughput and this kicked me off on my investigation. The slow LINUX box might be a cheap NIC but the new AIX box should be new fast gear. WIll check into it - thanks — Kyle Hailey, May 17 '11 at 03:23

score 2 · Accepted Answer · answered May 17 '11 at 02:33

I have seen behavior like this before. I've seen two causes for it:

Bad TCP/IP flow control negotiation
Bad drivers

TCP/IP flow-control problems are less likely in your case since both machines are running the same kernel and (except for the device kernel modules if different) therefore running the same TCP/IP code.

Drivers though.

I had a Windows 2003 server a while back that simply couldn't transfer more than 6-10MB/s to certain servers, and as that was a backup-to-disk server this simply wasn't acceptable. After looking at some packet captures, they looked a LOT like what you're seeing. What fixed it was to update the network drivers (broadcom as it happened) on the receiving server (the Server 2003 backup server) to something newer. Once that was done, I was getting 60-80MB/s.

Since this is Linux, you just might be running into a Large Segment Offload problem of some kind. This does rely in some part on the NIC hardware itself handling the splitting of large segments. If that is not working for some reason (bad firmware?) it can cause these kinds of odd delays. This is configured on a per-driver or interface basis. ethtool -K can configure it by device.

Ah, Broadcom; so much pain and anguish over the years... *shakes fist at sky* — gravyface, May 17 '11 at 02:48
sysadmin1138 - thanks, I didn't think about the drivers but will look into that tomorrow. I did want to get onto one of our boxes with both Broadcom and Intel Nics and run the same test to the same box with different NICs. WIll try that as well. I'll try checking out things with ethtool. thanks — Kyle Hailey, May 17 '11 at 03:28
yes, turns out one of the NICs had down graded to 100M instead of 1Ge! thanks ethtool and sysadmin1138 , but now to move onto the real problem which ia a new AIX box we installed and shows the same issues but advertsing and running a 1gE. WIll be checking for newer drivers for the NIC on the AIX box. — Kyle Hailey, May 18 '11 at 04:45

different ACK behaviors (slowing down throughput?)

1 Answers1