I've several hosts that are showing problems with connectivity. When working from the command line, for example, typing is frozen for a second or so, then recovers - then it does it again.
The most egregious example host would freeze (input) for 15-30 seconds, then recover and go out 5 seconds later. Switching cables didn't do anything - but removing one of the physical cables caused everything to clear up instantly (which why I think this is a network problem).
Looking at the network I couldn't see any packets floating that would explain this.
These ethernet interfaces (Gigabit Dell) were working normally previously, but since we moved the systems - and put them on a new set of switches - this has been a problem on multiple theoretically identically-configured hosts.
The original switches were an HP Procurve 1810-24G and an HP Procurve 1800-24G connected with LLDP; the new switches are both Cisco SG 200-26, which I understand are rebranded Linksys switches.
Is this caused by a problem with the switches? Is it the switch configurations? Are the Cisco switches incapable of handling this?
I don't see where the configuration is located; I searched the usual /etc/sysconfig/network/devices
but there's nothing in there about options (like mii polling) and nothing about the method of balancing the two. Searching scripts, I can't find anything in /etc/init.d/network
either.
The hosts are almost all Red Hat Enterprise Linux 5.x systems (5.6, 5.7) but some are Ubuntu Server 10.04.3 Lucid Lynx. I need help with both if it comes to that.
UPDATE: We're also seeing some problems with servers on the original switches.
The HP switches and the Cisco switches are also interconnected (temporarily); there is a cable run from one switch to the next. Pings on any of these hosts show about one ICMP packet out of every 5-6 getting dropped (timed out). Could there be an interaction between the two switches?
Oh, and the hosts are using bonding with Balance-RR as the method.
UPDATE: One of the Cisco switches is using Cisco Discovery Protocol (CDP) and our VMware 4 ESXi host is picking this up even though this ESXi server is connected to an HP switch.
UPDATE: I updated the Cisco switches and turned off CDP, LLDP-MED, and STP - the HP switches don't support CDP or STP and they don't appear to support LLDP-MED. So it's all shut off. That cleared up problems on the hosts on our old network connected to the HPs - but the hosts on the other side are still showing an unacceptable rate of lost packets - but only some hosts.
The hosts with no bonded interfaces are showing no problems. One or two such hosts have no bonded interfaces at all; one has a bonded interface but one port is disconnected.
What would happen if I took down a slave interface by hand? How does the bonded interface handle that?
UPDATE: Through testing, it seems bringing down a functional slave ethernet interface doesn't kill the entire thing (which is the way it should be). Doing this on selected hosts (but not all) cleared up the problem; for some reason, connecting to certain hosts dropped a large number of packets, and other systems did not experience this problem. Three hosts had their bonded interfaces reduced to a single ethernet port, and now dropped packets no longer seem to be a problem at all anywhere.
Of course, this doesn't solve the problem - it just made it go away (which is the number one thing right now). The next thing is to check physical cabling next time I go to the data center - the dual ports should be in separate switches, but are they? Will be checking.