2

We have been experiencing slow timeouts for unreachable hosts is extremely slow. Recent testing in our lab shows it may be a delay reporting negative ARP lookups. Dumping traffic during attempts to open a telnet connection to a local zone which was down for patching showed the following.

If the source was Linux three ARP requests were sent at 1 second intervals, and the connection failed in just over three seconds.

If the source was a Solaris server an initial five ARP requests were sent to the broadcast address at 1 second intervales. 5 seconds later more ARP requests were sent. ARP requests continued with increasing pause times until the connection failed after 3 minutes and 44 seconds. Tests were run from a global zone to a local zone on a different global. Both global zones are running on Sparc hardware. The devices are connected via level 2 switching equipment.

Are there any tunables which will result in a fast (3 to 5 seconds) ARP failure? Are there any other tunables which will cause connections to unreachable (downed) hosts to fail faster?

We appear to have the same or similar behavior between a variety of servers running on Sparc. As far as I can tell, Solaris is trying very hard get an address by ARPing the address, and does not time out very quickly if no host is replying to the ARP request.

BillThor
  • 27,737
  • 3
  • 37
  • 69
  • If you snoop on another machine, do you actually see the arp-whois messages on the Ethernet Broadcast address? Also what I don't understand: You tried to connect from a local zone to another zone on the machine, did I get this correctly? Also, which architecture? SPARC or Intel? – Alexander Janssen Oct 05 '12 at 16:10
  • @AlexanderJanssen I have updated the post with additional information. – BillThor Oct 09 '12 at 00:31
  • So SPARC then, eh? Is `local-mac-address` set to true? If it's not, the same MAC-address from a multiport ethernet card might be visible on more then just one port. This can confuse the switch. Refer to http://docs.oracle.com/cd/E19963-01/html/821-1458/geyqe.html – Alexander Janssen Oct 09 '12 at 19:30
  • @AlexanderJanssen The server is down, so no ARP responses are expected nor generated. Solaris keeps ARPing for several minutes before reporting the server is unreachable. – BillThor Oct 09 '12 at 23:06

1 Answers1

1

Did you consider running ndd /dev/arp \? to see a list of ARP related kernel configurables?

pfo
  • 5,700
  • 24
  • 36
  • Yes, we had checked all values, and did not find any values that would explain the behavior. Based on the arp_probe_count value of 3 we would expect behavior similar to what we experience with other operating systems. – BillThor Oct 09 '12 at 23:19