Linux bonded Interfaces hanging periodically

Question

I've several hosts that are showing problems with connectivity. When working from the command line, for example, typing is frozen for a second or so, then recovers - then it does it again.

The most egregious example host would freeze (input) for 15-30 seconds, then recover and go out 5 seconds later. Switching cables didn't do anything - but removing one of the physical cables caused everything to clear up instantly (which why I think this is a network problem).

Looking at the network I couldn't see any packets floating that would explain this.

These ethernet interfaces (Gigabit Dell) were working normally previously, but since we moved the systems - and put them on a new set of switches - this has been a problem on multiple theoretically identically-configured hosts.

The original switches were an HP Procurve 1810-24G and an HP Procurve 1800-24G connected with LLDP; the new switches are both Cisco SG 200-26, which I understand are rebranded Linksys switches.

Is this caused by a problem with the switches? Is it the switch configurations? Are the Cisco switches incapable of handling this?

I don't see where the configuration is located; I searched the usual /etc/sysconfig/network/devices but there's nothing in there about options (like mii polling) and nothing about the method of balancing the two. Searching scripts, I can't find anything in /etc/init.d/network either.

The hosts are almost all Red Hat Enterprise Linux 5.x systems (5.6, 5.7) but some are Ubuntu Server 10.04.3 Lucid Lynx. I need help with both if it comes to that.

UPDATE: We're also seeing some problems with servers on the original switches.

The HP switches and the Cisco switches are also interconnected (temporarily); there is a cable run from one switch to the next. Pings on any of these hosts show about one ICMP packet out of every 5-6 getting dropped (timed out). Could there be an interaction between the two switches?

Oh, and the hosts are using bonding with Balance-RR as the method.

UPDATE: One of the Cisco switches is using Cisco Discovery Protocol (CDP) and our VMware 4 ESXi host is picking this up even though this ESXi server is connected to an HP switch.

UPDATE: I updated the Cisco switches and turned off CDP, LLDP-MED, and STP - the HP switches don't support CDP or STP and they don't appear to support LLDP-MED. So it's all shut off. That cleared up problems on the hosts on our old network connected to the HPs - but the hosts on the other side are still showing an unacceptable rate of lost packets - but only some hosts.

The hosts with no bonded interfaces are showing no problems. One or two such hosts have no bonded interfaces at all; one has a bonded interface but one port is disconnected.

What would happen if I took down a slave interface by hand? How does the bonded interface handle that?

UPDATE: Through testing, it seems bringing down a functional slave ethernet interface doesn't kill the entire thing (which is the way it should be). Doing this on selected hosts (but not all) cleared up the problem; for some reason, connecting to certain hosts dropped a large number of packets, and other systems did not experience this problem. Three hosts had their bonded interfaces reduced to a single ethernet port, and now dropped packets no longer seem to be a problem at all anywhere.

Of course, this doesn't solve the problem - it just made it go away (which is the number one thing right now). The next thing is to check physical cabling next time I go to the data center - the dual ports should be in separate switches, but are they? Will be checking.

I have a server with two bonded interfaces connected to an HP 1800-24g switch. About a month ago, I changed from balance-rr mode to 802.3ad mode. It worked fine until recently, when I started having connectivity problems just like you describe. For now, I'm simply disabling bond mode and the problem has gone away. Unfortunately I don't have anything to add to your problem, but thank you for posting this, as it helped me find the cause of a similar (same?) issue! — Matt, Oct 14 '14 at 15:16

score 1 · Answer 1 · answered Mar 20 '12 at 18:27

Here's a link for the Ubuntu bonding stuff: https://help.ubuntu.com/community/UbuntuBonding

As for them in general it depends on which bonding mode you are using. If using Mode 4 then switch configurations need to be made for the cisco it is generally setting up LACP. Not sure on those models that you have though.

The other bonding modes don't require any special switch configs and it generally depends on if you want redundancy as primary or incoming or outgoing load balancing as to which mode you choose.

score 1 · Answer 2 · answered Mar 20 '12 at 18:35

Here's the link for RHEL6

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/sec-Using_Channel_Bonding.html

RHEL5

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Deployment_Guide/s2-networkscripts-interfaces-chan.html

As John said, using mode 4 requires 802.3ad or LACP protocols to be enabled on your switches. You may want to use mode 2 (Active-backup) for testing, or mode 6 (Adaptive Load Balancing) going forward.

Linux bonded Interfaces hanging periodically

2 Answers2