Troubleshooting intermittent network "drops"

Question

We do most of our work on colocated servers in a datacenter over SSH. This means that we're connected to the boxes almost all day, 5 days a week. Intermittently, we'll see a lag between typing on the keyboard, and having the contents echo'd back to us on the shell. I started doing some digging, and I'm having trouble understanding the results; I'm also looking for next steps to look at. Earlier, I ran a wireshark trace against tcp.dstport == 22, which seems to be where we have the majority of the problems. I did notice a large-ish (10-20 out of several thousand packets) that were TCP Retransmissions. I assume this is related to the lag issue we're seeing.

1) mtr to remote host

                                         Packets               Pings
 Host                                  Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. 192.168.100.254                    76.6%   454    0.5   0.5   0.3   4.7   0.4
 2. 10.113.128.1                       80.6%   454   17.3 130.8   5.7 6030. 726.7
 3. 74.128.19.209                      79.5%   454    9.7  25.8   6.7 1270. 133.2
 4. 74.128.8.233                       80.6%   454    8.5  31.9   6.6 1369. 150.6
 5. 4.71.250.1                         79.2%   454  1547.  50.5  14.7 1547. 194.1
 6. 4.69.138.158                       80.4%   454   20.1  29.7  15.4 1003. 104.5
 7. 4.69.140.189                       74.2%   454   16.2  28.6  15.0 920.0  85.5
 8. 4.69.138.4                         72.6%   454   17.0  41.2  15.5 821.6  81.7
 9. ???
10. 216.26.190.9                       79.4%   453   45.2 105.8  24.4 3008. 406.7
11. 216.26.162.162                     90.7%   453   28.3  40.2  24.1 556.3  81.7

2) mtr to 192.168.100.254 (happening simultaneously to above mtr)

                                         Packets               Pings
 Host                                  Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. 192.168.100.254                     0.0%   591    0.8   0.4   0.3   6.9   0.5

First question: why does the top mtr suggest packet loss at 192.168.100.254, when the bottom one does not?

Second question: how can I determine better what might be causing this?

EDIT:

mtr to first host outside our network:

                                       Packets               Pings
 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. edge.networldalliance.local      18.1%   393    0.5   0.5   0.4   1.8   0.2
 2. 10.113.128.1                      0.0%   393   10.0  10.1   5.5 744.3  37.4

separate mtr to second host in the hop:

                                       Packets               Pings
 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. edge.networldalliance.local      87.9%   424    0.8   0.7   0.5   1.2   0.1
 2. 10.113.128.1                      0.0%   424    9.5   9.5   5.2 577.8  27.8
 3. 74-128-19-209.dhcp.insightbb.com  0.0%   423    6.5  10.4   6.2 243.9  12.8

separate (again) mtr to third host in the hop:

                                       Packets               Pings
 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. edge.networldalliance.local      87.2%   440    0.6   0.7   0.4   2.2   0.3
 2. 10.113.128.1                      0.0%   439    6.4  10.9   5.6 991.8  47.2
 3. 74-128-19-209.dhcp.insightbb.com  0.0%   439    8.5  13.3   6.5 744.3  35.6
 4. 74.128.8.233                      0.0%   439    7.9  23.6   6.3 493.8  47.2

Any suggestions based on this new data? I'm going to see about getting the router / firewall replaced.

Mike Pennington · Answer 1 · 2011-06-09T01:06:51.647

Direct Answers

First question: why does the top mtr suggest packet loss at 192.168.100.254, when the bottom one does not?

mtr sends pings (ICMP echo response) with incrementing IP TTL until it gets a response. 192.168.100.254 responds differently when responding to TTL-expiration conditions (low success) vs ICMP echo response (high success)

Second question: how can I determine better what might be causing this?

When you say "causing this", I assume you mean your laggy ssh sessions, instead of the weird mtr results... right? A couple of thoughts...

Run mtr directly to every host in the 11-hop path and see if you can find some interesting symptom starting at one of the hops; based on your first mtr, this may not be much more productive, but it's worth a shot. Also talk to the administrator of 192.168.100.254 to see if you guys can figure out why ICMP TTL-expired replies are getting hosed.

Misc Thoughts

There are three general causes of network problems: packet loss, packet delay (queuing) or packet reordering. However, let's also remember that sometimes host-level issues contribute to your problem¹.
Let's assume for the moment that the 192.168.100.x vlan isn't where your problem is, and your topology looks like this:
```
    HOST_A----------------------HOST_B
    192.168.100.x               216.26.162.162
```

If you are not already ssh-ing from a windows machine to HOST_A, do so². Now record your windows desktop³. When the problem happens again, the recorded video is a very good audit trail for where your problems might be (i.e. either in the network, on hosts, or a combination of both). If you can somehow see ntp time in this video, all the better... this gives you a way to backtrack analysis through syslog as well.

END-NOTES

Is one of them swapping to disk, consuming lots of CPU (perhaps caused by a script / DB query), or intermittently busy?
With at least four windows, one for ssh between HOST_A and HOST_B, another for a sniffing session on HOST_A, the last two should be running top or vmstat 5 on HOST_A and HOST_B.
Use whatever you like, but I use Camstudio (the beta copy is my fav at the moment); it is free and open-source.

score 0 · Answer 2 · answered Jun 08 '11 at 21:29

To your second question: perhaps you can let ping run for a few hours to each of the hops you detected. Redirect the output to log files. Then extract the ping time with grep,awk,etc and plot it (Excel, OO Calc, etc). You should be able to see at which hops the lag starts.

What kind of Internet connection do you have? Oftentimes, upload saturation is suspect when you're dealing with high latency. Configure your router (or new router) to transmit at 85%-90% of maximum connection speed and setup a fair queuer on it to avoid ssh packets ending up at the end of the queue.

Troubleshooting intermittent network "drops"

2 Answers2