Bond0 Failover when link light stays up: High Availability Network

Question

Configuration:

Two switches, each with a separate internet route
Centos servers with eth0 and eth1 bonded as an active-backup on bond0, eth0 in on switch and eth1 in the other
/etc/modprobe.conf configured so, for bond0:

alias bond0 bonding

options bond0 mode=1 primary=eth0 miimon=100
eth0 was sometimes plugged in to the primary switch, sometimes the secondary.

Scenario:

Secondary switch has memory failure
Link lights stay up, but switch is no longer handling traffic

So because we used miimon, which just gets link status, none of our servers disabled that link from their bond when the switch failed. This caused network outages, and on the servers where eth0 was in that secondary switch, they became entirely unavailable. Ironically, this is worse than if someone had just gone through and yanked all the cables out, since they didn't fail over.

I have been testing arp_interval as an alternative, but as I understand it, arp_interval has two limitations:

arp_ip_target only takes one ip, meaning if that IP address goes down, bond0 will mistakenly think the link should be down, and take it down. I was using the gateway as the IP address, but if the gateway went down, It would be nice to still have internal-to-the-switch traffic continue. arp_ip_target won't do that, either; it will just shut down all interfaces, even to the last.
arp_interval depends on some amount of network traffic (?), where a very quiet link might get shut down by mistake.

Is there any way to get around those arp_interval limitations? Can miimon be configured any better? Is there a better way to accomplish HA Networking? We have been thinking of handling failover manually via a daemon on each server, instead of using arp_interval (i.e. monitor links ourselves and use ifenslave to take them up and down). We already aren't trunking for performance; reliability is really our priority here.

pQd · Accepted Answer · 2013-05-08T20:11:39.130

3

are you sure you tested it thoroughly?

according to this:

arp_ip_target specifies the IP addresses to use as ARP monitoring peers whenarp_interval is > 0. Multiple IP addresses must be separated by a comma.

i have mode=1 setup on couple of servers [although with single ip provided] and it runs just fine, even without any traffic flowing. fail-over was tested multiple times with and without traffic.

edited May 08 '13 at 20:11

answered May 06 '13 at 21:31

pQd

29,981
6
66
109

I too have servers set up with multiple addresses and have never had a problem. – Keith Stokes May 06 '13 at 23:10
I read through that document and never saw that; my bad. Thanks for pointing it out. The documentation seemed to imply, also, that arp requests are actively sent by the kernel at a specified interval, but I got the impression on other forums that it only passively counts the number of packets sent across. Which is it? (Source: http://www.linuxforums.org/forum/red-hat-fedora-linux/121449-bonding-flapping-between-interfaces-using-arp_interval.html) – Aaron R. May 07 '13 at 04:30
they are indeed sent actively. tcpdump confirms it. – pQd May 07 '13 at 15:06
I tested this configuration (quiet server and multiple IPs routed), and it works as you suggested. Thanks for the help pQd! – Aaron R. May 08 '13 at 19:29

Bond0 Failover when link light stays up: High Availability Network

1 Answers1