Configuration:
- Two switches, each with a separate internet route
- Centos servers with eth0 and eth1 bonded as an active-backup on bond0, eth0 in on switch and eth1 in the other
/etc/modprobe.conf configured so, for bond0:
alias bond0 bonding
options bond0 mode=1 primary=eth0 miimon=100
- eth0 was sometimes plugged in to the primary switch, sometimes the secondary.
Scenario:
- Secondary switch has memory failure
- Link lights stay up, but switch is no longer handling traffic
So because we used miimon, which just gets link status, none of our servers disabled that link from their bond when the switch failed. This caused network outages, and on the servers where eth0 was in that secondary switch, they became entirely unavailable. Ironically, this is worse than if someone had just gone through and yanked all the cables out, since they didn't fail over.
I have been testing arp_interval as an alternative, but as I understand it, arp_interval has two limitations:
- arp_ip_target only takes one ip, meaning if that IP address goes down, bond0 will mistakenly think the link should be down, and take it down. I was using the gateway as the IP address, but if the gateway went down, It would be nice to still have internal-to-the-switch traffic continue. arp_ip_target won't do that, either; it will just shut down all interfaces, even to the last.
- arp_interval depends on some amount of network traffic (?), where a very quiet link might get shut down by mistake.
Is there any way to get around those arp_interval limitations? Can miimon be configured any better? Is there a better way to accomplish HA Networking? We have been thinking of handling failover manually via a daemon on each server, instead of using arp_interval (i.e. monitor links ourselves and use ifenslave to take them up and down). We already aren't trunking for performance; reliability is really our priority here.