0

First a little background: On the (isolated) /16 LAN in question, we have several devices that keep several persistent TCP connections open between them. The program at each end of these TCP connections sends a "heartbeat" packet to its partner once every two seconds; and also each program keeps track of when it last received a heartbeat: if it hasn't received a heartbeat packet for four seconds, it figures something is wrong, closes the TCP connection, reports a problem to the user, and then tries to re-establish the connection.

Also on this LAN is a Linux box that runs the following command periodically:

/usr/bin/arp-scan --interface=bond0:2 --localnet --bandwidth=2560

It does this to find out if there are any duplicate IPv4 addresses on the LAN; if so, it reports the problem to the user.

This is all fine, except that occasionally (e.g. once every few days) we get heartbeat-timeouts for no obvious reason, and there has been some speculation that the arp-scan may be interfering with the TCP traffic such that the heartbeats are getting held off long enough to trigger the 4 second timeout. These events often happen at night, when the LAN is more or less idle (except for the heartbeat packets and the arp-scan, of course). When these events occur, the TCP connection is always immediately and successfully re-established, but the resulting error messages are making the users nervous, so I'd like to figure out what is going on here.

My question is: is arp-scan's scanning mechanism intrusive enough that it might plausibly be the culprit here? Note that we supply a --bandwidth=2560 parameter so that it shouldn't use up a significant amount of bandwidth during the scan; but perhaps the arp packets cause the arp<->IP address caches to be flushed, or something like that?

Jeremy Friesner
  • 1,323
  • 1
  • 14
  • 25

2 Answers2

2

Personally, I would just stop automatically running the arp-scan, and just run it manually a few times during the day. Give it a couple weeks and see if it really is the arp-scan that is causing your issues, because I'd bet that it is completely unrelated.

I'd also start tcpdumping both sides so you can see which packets actually get sent/received.

But, really, a TCP connection is never going to last indefinitely. As long as your app "always" is able to re-create the connection, why do you alert the user? Why not just silently re-create the connection, and only throw an error if the re-creation fails or you detect you're creating more than X connections per hour/day?

opsguy
  • 801
  • 1
  • 5
  • 12
1

arp-scan just sends arp-who-has requests to the broadcast address - that's what is happening on the network all the time anyway, so there would be no reason for it to disturb any connections.

Even if the ARP caches of a host get overflown, it will just issue an arp-who-has request on its own before sending an IP packet - it will delay the packet by at least the RTT, which is three magnitudes lower than your timeout value in LAN environments and thus negligible.

TCP is not the best protocol to use with very frequent heartbeats - every segment (i.e. acknowledgement) lost on the link will delay its reception by at least one second (the minimal retransmission timeout value). If losses should come unfortunate enough to happen 2-3 times in a row on a certain link, you will get your application timeouts.

Another possible explanation could be the load of the host sending the heartbeats - if it is doing some high-priority jobs at a high saturation, your heartbeat-generating threads may suffer from short-run starvation and not get the heartbeat out in time.

So to pinpoint the problem, I'd check the data link layer counters for errors or possible flow control influence and the performance counters of your heartbeat-generating server for possible CPU or memory bottlenecks at night. If you don't find anything suspicious, just increase the timeout :)

the-wabbit
  • 40,737
  • 13
  • 111
  • 174