3

I'm just guessing that arp is my problem...

I have a linux drbd server cluster set up, and due to some power issues had to unplug the switch that connects the two servers. As a result, both servers became primary and took the same IP address for several seconds. (this caused a split-brain condition , but that's another issue)

My problem is that now some servers seem to be able to see the shared IP address of the cluster, and some cannot. I am wondering if this could be a situation where some switches/ports are sending the traffic to one server, and some to the other?

And if this IS the problem, how can I resolve it?

  • and... is this done at the switch, or on the server?
Brent
  • 22,857
  • 19
  • 70
  • 102

3 Answers3

8

If it's really an arp issue, the problem will be confined to the network device doing the routing (since that what ARP is for - mapping L3 addresses (IP) to L2 addresses (MAC)) or possibly in the ARP cache of a server sitting in the same IP subnet. It won't involve a switch unless it's an L3 switch.

To address the problem on a cisco router, you can run the following command to clear the arp cache and allow it to rebuild:

clear arp

To remove the bad arp entry from a server which may be caching bad information (so, not the server that can't be reached, but the server that can't do the reaching) you can manually delete the bogus entry out of the ARP cache, where IP address is the IP of the server which can't be reached. Note this same syntax appears to be valid on both Linux and Windows:

arp -d <ip-address>

You can also send a gratuitous ARP from the server which can't be reached to cause other hosts on the same IP subnet to update their ARP caches (I have this in my notes, but I admit I haven't used it in a long time. I can't remember if this allows you to skip the steps above, or just shortens the process of the other hosts adding an arp entry after running the commands above):

arping -q -A -c 1 -I eth0 <ip-address>
arping -q -U -c 1 -I eth0 <ip-address>

All of the above is for an ARP issue, but you specifically mention a switch in your question. If it's a switch that only uses L3 for management, then the data flow problems would have to be problems with the MAC cache, not the ARP cache. In that case, you could run the following on the switch to purge the dynamic cache contents:

clear mac-address-table dynamic
jj33
  • 11,178
  • 1
  • 37
  • 50
  • Does this just flush the arp entry on the server, or on the switch/router as well? Do I need to log on to the switch/router to clear the entry there? – Brent Jun 12 '09 at 18:12
  • fleshed out the answer in response to your comment, hope it's more clear now. – jj33 Jun 12 '09 at 18:29
1

You can use the arp command in Linux to delete a particular entry with the -d switch. If you have managed switches, you can probably clear the arp cache, with cisco it would just be clear arp . Other than that, you can of course always just power off and on all the switches and they should rebuild their tables.

Kyle Brandt
  • 83,619
  • 74
  • 305
  • 448
0

were the switched powered off also through this power outage ? maybe they lost the last configuration change, the one that say "for this mac address, then packet to these 2 ports".

Cisco switched have to be set up like hub for the virtual mac, so they send all packet related to the virtual mac to both hosts.

Mathieu Chateau
  • 3,185
  • 16
  • 10
  • It is not true that the switch needs to be set up to send packets to both mac addresses. Heartbeat manages updating the arp entries automatically. – Brent Jun 12 '09 at 18:15
  • it depends on the setup. When it has one virtual mac with both host active at the same time. I don't know if it apply for his case, but for NLB, look this cisco documentation for NLB in multicast: http://www.cisco.com/en/US/products/hw/switches/ps708/products_configuration_example09186a0080a07203.shtml exerb: mac-address-table static 0300.5e11.1111 vlan 200 interface fa2/3 fa2/4 – Mathieu Chateau Jun 12 '09 at 19:30