4

I'm having a server with network connectivity issues that I presume come from issues with arp protocol handling.

Let's say the network topology is as follows:

  • network 192.168.106.0, netmask 255.255.255.0
  • router at 192.168.106.1
  • "problem server" at 192.168.106.2
  • another server at 192.168.106.3

Now, assume that the "problem server" may be silent on the network for periods long enough for its arp entry on the router to expire.

When someone from outside this network attempts to connect to the "problem server", all attempts time out. Connections from within the network to the "problem server" succeed.

If the "problem server" itself attempts to connect to some other address outside the network, the connection succeeds -- and after this, also connections from outside the network to the "problem server" succeed for a while. Also, connections from the "problem server" to "another server" are ok.

Looking at arp traffic in the case where the "problem server" has been silent for a long time, I can see arp requests on the network for the "problem server" address, but the "tell" address on these is the network address (192.168.106.0) instead of the router address (192.168.106.1) -- and this is what I assume to be the reason for this problem: for some reason the router has wrong reply address in its arp requests.

The "another server" remains reachable, but there I assume the reason to be that it frequently makes connections to outside the local network, and thus keeps its arp entry at the router from expiring.

Any comments / suggestions?

The servers in question are running Linux (CentOS 5.x?), and are running as VMs within VMWare ESXi (5.0?) (I'll check/fill in version details once I get back to work on Monday). The router make/model is unknown for me.

Responses to questions, further findings

Apologies for being slow to return this.

Unfortunately my visibility to the network side (anything beyond the VMWare platform itself) is severely limited.

Based on the arp request packets from the router, it is a Juniper product (guessing by requestor MAC address).

This is a small network, so consider topology as a router, switch, and a single VMWare server hosting several virtual machines.

As for the originator of the odd arp requests, it pretty much has to be the network gateway: they only appear when I try to connect to the "problem" machine from outside the network - and cease when the attempt times out or is cancelled. A minor oddity is that the MAC address in these requests is not the same that is seen for the router in the server arp table after establishing an outbound connection. However, both the MAC address present in these "odd" requests as well as the MAC address shown in the server arp table have a Juniper-assigner OUI.

Then one possibly related finding; it seems that Linux won't respond to arp requests where "tell" address is the network address, whereas Windows (Vista at least) does. This I wasn't able to test in the actual problem environment, but with my own toys at home.

Also, it looks like I'm not completely alone with this issue; a similar experience can be found here: alpacapowered.wordpress.com

Juha Laiho
  • 151
  • 6
  • What kind of router, and what's it's config look like? – Shane Madden Aug 31 '12 at 16:53
  • 1
    Please post a description of the topology of the net. Sounds to me like some VMWare related ARP issue. Look into KB at VMware, e.g. at this: http://kb.vmware.com/selfservice/search.do?cmd=displayKC&docType=kc&externalId=1005965 – jpe Aug 31 '12 at 16:55
  • Yeah, need to know router/see config. That doesn't sound right though. What do other machines' ARP conversations with the router look like? – gravyface Aug 31 '12 at 17:02
  • You should be able to work around the problem by adding a `ping -c3 192.168.106.1` command to cron on your servers to be run every 10 minutes or so, this way you ensure that the arp entries won't time out. But this will not get to the cause of the problem. – jpe Sep 01 '12 at 09:21
  • Which device is producing the tell requests with a source of 192.168.106.0? I'm guessing it's the router. Track it down by MAC address and check for problems like out of memory or erroneous ACLs (or just plain bad software). – Paul Gear Sep 04 '12 at 21:58
  • In general it is OK to have the network address for the "tell" address, 192.168.106.1 will receive the packet as well if it gets sent to 192.168.106.0. `arping` is a good tool to try from some other machine to see if the router behaves in a wrong way or something else. But this thread seems kind of dead, as there are no responses to additional questions from the OP. – jpe Sep 05 '12 at 08:03
  • Can you add a static ARP entry to the router? Also, even if the router has dropped the ARP entry from it's table, it shouldn't stop an incoming connection from dropping unless it take an unusually long time for the "problem server" to respond to ARP request, unless you have packet loss it should respond on the first request. Can you run a packet capture on the "Problem server" and see where it is sending it's ARP replies to (what MAC address?). – jwbensley Sep 05 '12 at 20:51
  • Check the source ether address of the arp that says `tell 192.168.106.0` (use `tcpdump -pnn -e arp`). Check that against your arp table (`arp -na | grep `). That will let you know wether or not the "problem" arps are coming from the router. – bahamat Sep 05 '12 at 21:06
  • The best way to approach your problem would be to sling the Juniper device as far as you can away from anything that you need to be up and running for extended periods of time (use some fireworks rockets for better distance). Or if some of the above hacks work, your are lucky. For now. – jpe Sep 05 '12 at 21:22

2 Answers2

1

Today brought an interesting change of situation.

Eventually, things boiled down to two things:

The Juniper router, or actually a clustered firewall system had somehow lost its configuration syncronisation between the cluster parties. As a result, not all parts of the FW cluster had up-to-date configuration, and this resulted in the arp requests being wrong (yes, the bad arp requests did originate from the router/firewall).

The management application for the firewall also did misbehave, trying to push some other than current, correct, configuration to at least part of the firewall cluster.

I don't have the details on what was done for the firewall itself, nor for the management application, but the end result is that now the "tell" address on the arp requests is the router IP address (.1 from the original description), instead of the network address (.0).

And to these ("who-has ... tell ... .1") arp requests the Linux server responds just as it should, and the inbound connections work just dandy, even long after any trace of the server address has been lost from the routers arp cache.

Juha Laiho
  • 151
  • 6
0

I ran into the exact same issue. Turned out that someone had set the manage-ip value to the subnet address:

Cluster:name(M)-> get config | inc aggregate10.200 set interface aggregate10.200 ip x.x.x.x.225/28 ... set interface aggregate10.200 manage-ip x.x.x.224 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To fix:

unset interface aggregate10.200 manage-ip

This was a misconfiguration in our case.

Mike
  • 1