Why am I getting long TCP connect latency on connect in a LAN (over a cross!)?

Question

I am measuring a time of about 100-150 milliseconds from sending TCP SYN to getting SYN/ACK, between two linux computers connected to the same Cisco switch. Consider:

The machines are very powerful, and neither them nor the switch is heavily loaded.
From analyzing tcpdumps logs on the two machines I see the problem is not in the endpoints but rather in the network itself (the client sees 100-150 ms delay, but the server processes the responses in about 10 ms).
Only SYN requests are slow. Afterwards, a normal TCP packets gets an ACK right away.

So, my questions are:

Am I right to think this is way, way too much?
What latency should I aim for?
What can I do to further diagnose and solve the issue?

Edit - We've taken the switch out of the equation. The two computers are now connected in a cross cable, and we're still seeing the problem. Both are on full duplex, 100 MBPS.

chris · Answer 1 · 2009-12-08T16:10:17.710

3

The usual suspects:

Duplex mismatch
- check on switch for collisions or errors
- check on hosts for collisions or errors
If you see collisions, that end is half duplex and should be set to full. If you see errors, check the other end for collisions. If both ends have errors, you may have a bad cable.
DNS timeouts
- log onto one host, lookup with nslookup the IP of the other. You should get a name or an error very quickly

edited Dec 08 '09 at 16:10

answered Dec 08 '09 at 15:52

chris

11,944
6
42
51

DNS timeouts are irrelevant, we're testing with IP addresses. Duplex settings were verified (see updated question). – ripper234 Dec 09 '09 at 11:04
The DNS that often causes the problem is the side listening will attempt to do a reverse lookup of the IP establishing the connection. – chris Dec 09 '09 at 13:21

score 3 · Accepted Answer · answered Dec 09 '09 at 12:19

3

Well, crap. It appears I misread both the tcpdump and wireshark logs. The delay I was getting was 100 microseconds, not millis!

alt text http://ironicsurrealism.blogivists.com/files/2009/10/homer-simpson-doh.gif

answered Dec 09 '09 at 12:19

ripper234

5,890
9
41
49

So that's fast, right? – chris Dec 09 '09 at 13:45
Very much so, indeed. – ripper234 Dec 09 '09 at 13:55
1

Back in the old days radar caused ships to sometimes crash into each other because the pilot would forget to reset the scale of the readout. "Don't worry, that ship is miles away." "Feet? oops." – chris Dec 09 '09 at 14:07
@chris - interesting. Have you got a source for that anecdote? – Cheeso Oct 24 '12 at 17:46

score 1 · Answer 3 · answered Dec 08 '09 at 16:12

1

Have you checked the cabling? Bad cables and/or punchdowns can result in retries that can greatly increase latency.

answered Dec 08 '09 at 16:12

Brian Knoblauch

2,196
2
32
48

Happens over all our environments, many different cables. – ripper234 Dec 09 '09 at 11:05

score 1 · Answer 4 · answered Dec 08 '09 at 22:11

What model of Cisco switch are you using? One thing that could be happening is if the switch doesn't know which port you're server is on, it will need to flood all ports with the packet, which could take time (shouldn't take 100ms though). You can verify by running TCP dump on another server that isn't one of the two servers you are using. Once the server responds, it will then learn the port-mac assignment and do the forwarding in asic. This could be especially prevalent on lower end cisco switches.

Also, do you have per-port ACL's? That could also require CPU switching which would be orders of magnitude slower than in ASIC. Do you have the same problem when running pings, in that the first ping has 100ms delay, and then subsequent pings are <1ms? If it's a lower end switch and only getting delay on tcp/ip, I'd check that there isn't an ACL that is applied to TCP/IP packets.

I would also check the switch for CPU load, even if it's low usage, if it's got some stupid config that is causing it to switch in CPU, it can easily be overloaded. We've overloaded high end switches (10Gbps backhaul) with traffic in the 100Mbps range because we were inadvertently sending traffic that had to be switched within the CPU.

Removed the switch from the equation (used a cross cable), and it still happens. — ripper234, Dec 09 '09 at 11:05
That is incredibly weird??? What do you're ping results look like? Also, I don't know if it helps, but I once encountered a problem on freebsd where a kernel operation was freezing the entire system for a few seconds. We noticed this because we were running VRRP and every day or two the VIP would fail over. We finally matched it to a kernel log that had something to do with memory. At this point I would check to make sure you're not using cheapo nics, latest drivers, and maybe look for related kernel problems/logs, since that will also affect you're TCP-dump causing it to look normal. — Kevin Nisbet, Dec 09 '09 at 15:10

score 0 · Answer 5 · answered Dec 08 '09 at 15:53

0

This seems like the latency you would get going from one side of US to the other. Is the switch managed? Can you connect to the switch and check for issues? I would expect <1-2 ms on a local network

answered Dec 08 '09 at 15:53

Dave M

4,514
22
31
30

score 0 · Answer 6 · answered Dec 08 '09 at 15:58

0

In my experience Cisco switches should insert less than 1ms to the latency, so yes, this is an indication of a problem.

Are both devices connected to the switch via wires (i.e. not 802.11)? In the same VLAN?

Is this a trusted network? If the devices and switches are lightly loaded I would be concerned that someone was using an ARP hijack to insert themselves in the traffic flow as a man-in-the-middle...

If you check the ARP table on these boxes (arp -an) and check the IP address of the other box with the output of ifconfig, do the MAC addresses match?

You mention that you are analysing tcpdump output. Are you comparing the timestamps between the two boxes? If so, are you sure that the clocks are in sync?

Do you have access to a third host on the network to compare performance to the other two boxes?

answered Dec 08 '09 at 15:58

Russell Heilling

2,557
19
21

I'm comparing the time difference between the two computers, no the absolute times. – ripper234 Dec 08 '09 at 16:15
The problem with comparing time deltas is that it's impossible to tell whether there is a big delay in one direction, or whether there is a smaller delay in both directions :( – Russell Heilling Dec 08 '09 at 16:45
They are both connected to the same VLAN, wires, I have no reason to suspect an ARP hijack. I am experiencing the problem from all devices to all devices in the network. Does this mean it's a bad switch configuration? Note, the problem is only felt at the TCP connect phase. Other packets get an ACK quickly. – ripper234 Dec 08 '09 at 20:45

Why am I getting long TCP connect latency on connect in a LAN (over a cross!)?

6 Answers6