Phantom NIC issue causing eth0/1 to drop out

Question

We are experiencing a very strange and frustrating problem. Our company has servers here in Massachusetts, as well as in California. The issues we are seeing are only on the CA hardware. Out in CA, we have several hundred Dell R300 and Dell R310 servers, all connected to four HP Procurve 4208vl switches. There are two switches for each model, one for the front-end network, and one for the back-end network. These systems are aranged in clusters and all are used for various tests that we run to test out our software OS we are developing. Many of these tests require sucessive and/or repeating reboots. Many, if not most tests, re-provision the nodes with the Os again. The problem is that occuring is, given enough time, seemingly at random, one (or many) of these systems will have a downed eth0 or eth1 interface.

The issue is the node will intermittently boot up with no connectivity on either eth0 or eth1, sometimes both. The workaround is to SSH in via backend (if eth0 is down) or frontend (if eth1 is down) and run ifdown/ifup on the downed interface.

List of workarounds: - service network restart - ifdown eth1 (or eth0), then ifup eth1 (or eth0) - reseat the network cables - reboot the server

This is a huge pain for the development team as it will stop entire clusters from running their tests until manual intervention.

The worst part occurs when a node boots up busybox for an OS install and eth0 drops out: in this case the node is completely unreachable since we don't have eth1 in busybox, and the OS install can't proceed because it can't talk to the PXE server to pull down the latest image of the OS (since eth0 is down). Nodes that fall into this state will get stuck like this until the next time I get someone in CA on the phone and have him manually reboot the node.

The following has been done to attempt to resolve this seemingly random and irreproducable issue:

Both Procurve Switch and R310 firmware have been updated to latest revisions possible.
Both Switches and Servers set to Autonegotiate (1000/FULL DUPLEX).
We're seeing this accross 4 different HP switches and about 200-400 Dell servers (they were all purchased at different times, so it's not just a bad lot).
We do not have this issue on other hardware in CA, including Dell 860s and 750s plugged into their own HP Procurve switch.
This issue does not appear to happen when the nodes are plugged into a different switch (although we lack the hardware to test full with on a different switch).

Before the firmware upgrade, the HP Procurve switch logs show:

excessive broadcasts detected on port x
high collision or drop rate on port x
excessive CRC/alignment errors on port x

After the firmware upgrade, we see less of these errors, yet they still persist.

For troubleshooting, I have been logging the usual info:

ifconfig ; for n in 0 1; do ethtool eth$n;ethtool -i eth$n;ethtool -k eth$n;ethtool 
-S eth$n; done; dmesg | egrep 'eth|bnx|e1000'; cat /var/log/messages > /tmp/eth_issues

Here are some examples of output:

# ethtool -i eth0
driver: bnx2
version: 2.1.6
firmware-version: 6.4.5 bc 5.2.3 NCSI 2.0.11
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on

 # ethtool -S eth0
 NIC statistics:
 rx_bytes: 0
 rx_error_bytes: 0
 tx_bytes: 5676016
 tx_error_bytes: 0
 rx_ucast_packets: 0
 rx_mcast_packets: 0
 rx_bcast_packets: 0
 tx_ucast_packets: 0
 tx_mcast_packets: 7
 tx_bcast_packets: 10495
 tx_mac_errors: 0
 tx_carrier_errors: 0
 rx_crc_errors: 0
 rx_align_errors: 0
 tx_single_collisions: 0
 tx_multi_collisions: 0
 tx_deferred: 0
 tx_excess_collisions: 0
 tx_late_collisions: 0
 tx_total_collisions: 0
 rx_fragments: 0
 rx_jabbers: 0
 rx_undersize_packets: 0
 rx_oversize_packets: 0
 rx_64_byte_packets: 0
 rx_65_to_127_byte_packets: 0
 rx_128_to_255_byte_packets: 0
 rx_256_to_511_byte_packets: 0
 rx_512_to_1023_byte_packets: 0
 rx_1024_to_1522_byte_packets: 0
 rx_1523_to_9022_byte_packets: 0
 tx_64_byte_packets: 1054
 tx_65_to_127_byte_packets: 7
 tx_128_to_255_byte_packets: 0
 tx_256_to_511_byte_packets: 0
 tx_512_to_1023_byte_packets: 9441
 tx_1024_to_1522_byte_packets: 0
 tx_1523_to_9022_byte_packets: 0
 rx_xon_frames: 0
 rx_xoff_frames: 0
 tx_xon_frames: 0
 tx_xoff_frames: 0
 rx_mac_ctrl_frames: 0
 rx_filtered_packets: 0
 rx_ftq_discards: 0
 rx_discards: 0
 rx_fw_discards: 0

We've spent countless hours on the phone with Dell and HP and we can't seem to figure out what is causing this issue. At first we thought the firmware upgrades would fix it, but after going nowhere both companies claim that they cannot support either party's hardware and refuse to help any further than that.

Can someone help me track this issue down to the root cause? Keep in mind that I never know when or which system will be the culprit and the OS gets re-provsioned a lot, so installing software to help log this is useless since it will be lost during the product's next provisioning. Any help or insight you could provide would be appreciated. Any hunches or thoughts are welcome, too. Please let me know if you need more details or output posted. Thanks.

It's a custom OS based off Fedora 14. There's not a whole lot changed under the hood, though. Nothing that would suspect it of this issue. It's just modified for our product, which is a storage platform product. — Brendon Martino, Jan 06 '12 at 18:11

score 3 · Answer 1 · answered Jan 05 '12 at 21:26

3

The answer is: get a better NIC and note to self to never buy Broadcom again:

http://blog.serverfault.com/2011/03/04/broadcom-die-mutha/

answered Jan 05 '12 at 21:26

Hubert Kario

6,361
6
36
65

1

All well and good advice, but I simply cannot control what my company does. I wanted to test out these nodes on another type of switch, but they don't have the funding to even do that, much less replace all 300 nodes. Thanks for your concern, though. – Brendon Martino Jan 06 '12 at 12:19
(also fair to say that they would not replace all nics, as this would not only be a TON of work across several states, but it would also require a ton of changes to our database to update the new nic info and that just isnt going to happen). – Brendon Martino Jan 06 '12 at 12:30
Then I can only pity you. If this is business-critical: why on earth you don't have budget to buy at least few Intel NICs to test if this is indeed the root of the problem? Your man-hours costed more already, let alone depreciation of the hardware. If it's not business-critical: you'll have to live with crappy NICs. CRC/alignment errors are *exactly* the things that are described in blog.SF, both in main entry and comments. – Hubert Kario Jan 06 '12 at 12:50
1

Oh, and showing support "hey, we changed NICs and the problems went away", is surely worth something. For example, pointing the problem to Dell hardware, not HP switches. – Hubert Kario Jan 06 '12 at 12:57
I agree, and frankly, if I could switch out all the NICs myself, I would. However, the main sticking point with management is that this has not always been an issue. It has been ongoing for over a year now, but I am relatively new to the company (3 months), so this is all news to me. The problem is that these servers reside in California. We'd have to change NICs on hundreds of servers that are currently being used all the time. – Brendon Martino Jan 06 '12 at 13:59
Even if we could do that, which would be a HUGE undertaking to schedule the replacement of them all, we would also have to replace the MACs in our database. The man-hours alone to track all those changes would be ghastly.If, on the other hand, you wanted to check only a few NICs that you changed to Intel (or whatever), how exactly would you prove that they are stable? – Brendon Martino Jan 06 '12 at 13:59
This problem occurs so randomly and without rhyme or reason that you could never prove anything without replacing at least half of all servers' nics. Because even if you did replace one or two nics, it wouldn't prove anything in this case. Do you understand what I mean by that? – Brendon Martino Jan 06 '12 at 13:59
You take 10 servers with Broadcom NICs and 10 servers with Intel NICs (the more the better), keep on re-provisioning them (the Broadcom issue is most apparent when the cards are working full time). See if the machines with Intel cards crash too and how often the Broadcom based ones crash. If you run the test for a week, during which you had 8 or 10 outages with Broadcoms and none with Intel, it'd say it's a pretty clear cut situation. Just being sure where the problem lies is worth much, what the management does with this information isn't important. You've done your job. – Hubert Kario Jan 06 '12 at 14:09
I agree... you may not be able to control your company... but the end result is... you're the IT department. If they buy stuff that doesn't work... you point at them and tell them they shouldn't buy crap... and then they get to buy it again... this time with the right bits. – TheCompWiz Jan 06 '12 at 14:58
I know, it's a fight. They are insistent that these bnx2 drivers never had issues in the past, and we have tons of hardware with them, so they find it hard to believe that it's the culprit. I am having a hard time making a case that they are the issue, since i have no proof. Even when this problem happens, the output of ethtool shows that the Link is ready even when the device is downed. – Brendon Martino Jan 06 '12 at 18:04
Even if I got them to replace 10 nics, over one week, it wouldn't prove much. We have over 400 of these servers combined (R310s and R300s) and even more of other ones (R610s, etc.). When you only come across four or five of these issues a week, it hardly seems fair to say that the ten that I selected were immune. They could just be lucky. I still don't see how it proves anything with such a vast range of servers... – Brendon Martino Jan 06 '12 at 18:07
Problem is - those Dells have mainly Broadcom on board - and the PCIe are normally not able to PXE-boot. – Nils Jan 06 '12 at 20:32
They had no issues because they weren't used to their full. Of course you won't have problems with boxes sitting there doing hardly anything (like 95% servers in SMBs out there). I also didn't say anything to put the Intel cards to production. I said: do a test, make the cards work as hard as they can, if they start falling apart, they're the fault. Either they are the problem, or they are not. There are tons of threads all over the net about crappy Broadcom NICs (there are at least 3 on SF alone), you won't find ones about Intel cards or HP switches: I'm 99% sure the cards are the problem. – Hubert Kario Jan 06 '12 at 21:15
You could be right, but the problem is that I need proof to justify replacing them to my managers. No offense, but you guys in this forum simply won't do for their justifications. Even if I did a test, it still doesn't prove anything because it isn't going to be the exact same tests as the rest of our stuff. And I can't reproduce that environment...Isn't there some way to run a diagnostic on a server when this issue happens to pinpoint the exact fault?? – Brendon Martino Jan 09 '12 at 15:45
The problem lies in hardware. You (or your bosses) don't believe HP switches when they tell you that the NICs send garbage. You can't debug hardware, short of connecting serial server to each and every JTAG on every server (or one and hoping that this one will fail). Doubly impossible because you don't have the schematics or source code for firmware of the hardware. The only remotely doable thing is to do a test. If you don't want or can't to do that, then sorry, but you won't fix this problem. – Hubert Kario Jan 09 '12 at 17:01
Yeah, I see your point. Believe me, it's not me. I believe what you guys have been saying. The task up to me is to somehow prove it to my managers BEFORE replacing hardware. It is a daunting task...I'm at the end of my rope. They refuse to believe it because they claim they have a history with these Broadcom NICs and dont think they were bad in the past. It's frustrating to say the least....I'm out of ideas to try....that's why I came here. :( – Brendon Martino Jan 09 '12 at 19:22

score 2 · Answer 2 · answered Jan 05 '12 at 18:23

2

Honestly, I doubt it's an issue with hardware at this point... and more an issue with the underlying driver in the OS you're trying to boot. In my own experience the bnx2 driver is notorious for being pretty terrible... as it's written by Broadcom to try and make opensource users happy, but not much more than that. Have you tried downloading/building drivers directly from broadcom? It would be more interesting to see what's in the insane amount of broadcast packets... (read that as try capturing packets between the NIC & Switch) and throw that at Boadcom for feedback. The old switch(es) may have not complained, because they didn't bother dealing with the flood of bad packets... (high amount of errors reported on new switch)

answered Jan 05 '12 at 18:23

TheCompWiz

7,409
17
23

No, we haven't but I can give that a try. Not really sure if building a driver is going o do too much, though, as it is already using one that is best suited for the OS. How might I go about capturing all those packets? (Remember - I never know which node will be the culprit, and I can't install anything on all nodes because they all get re-provisioned, eventually, through their normal cycle of testing.) – Brendon Martino Jan 05 '12 at 19:28
Seems like it would take a lot to build the driver and install it into all of our nodes. That's the only way we would be able to tell if it works or not. – Brendon Martino Jan 05 '12 at 19:49
As far as checking out the packets, couldn't we just use wireshark or something for that? – Brendon Martino Jan 05 '12 at 19:50
Of course, the problem is, we never know WHICH node to monitor/capture packets. !?! – Brendon Martino Jan 05 '12 at 19:56
Is there anything in the packet that would sound a red alarm to you as far as this specific network issue (in other words, what would I be looking for, and what would be an obvious red flag to check out)? – Brendon Martino Jan 05 '12 at 19:58
A good red-alarm... is any data that looks like it shouldn't be there. i.e. random garbage... packets of random types... etc... – TheCompWiz Jan 05 '12 at 20:18
I'm inquiring to see what can be done to enable this type of packet capturing on each server (the OS would need to be modified as servers may get re-provisioned frequently), however that is not my decision to make. You can see my problem, though, almost any attempt to do this would require me to either A) know exactly which node to look at and when or B) a global change made to a group of nodes, which would obviously require the OS itself to be modified since each node gets reprovisioned so often. – Brendon Martino Jan 06 '12 at 12:24
Honestly... (in my experience...) this is just good practice. You should be able to monitor traffic of any node in your network and look for suspicious traffic. Your HP switches should have a "port mirroring" feature that will allow you to capture data from any/all nodes in question... but keep in mind... 10x1gb connections may saturate the bandwidth of your monitoring port. – TheCompWiz Jan 06 '12 at 14:57
I've looked at the capabilities of the switch, but I don't think that's a possibility for the reasons you mentioned. We have close to 200 servers on one switch (200 frontend nics and another 200 backend). The bandwidth would be toast. – Brendon Martino Jan 06 '12 at 18:10

score 2 · Answer 3 · answered Jan 05 '12 at 21:32

2

We have a number of R300 and R310 - and never had an issue after booting them. BTW - what does Dell support say to your case?

So my guess is that there is something wrong on the network side of the hardware (Procurve Switches). However if I were you I would write a simple workaround:

An init-script that runs at a late stage and does the ifdown/ifup if no link is detected on eth0 or eth1.

BTW: eth0 and eth1 are both on board? Then both should be able to do PXE-boot (I am not at work right now, so I am not sure about the number of onboard-interfaces - I usually use the bigger brothers R510, R710, ...).

answered Jan 05 '12 at 21:32

Nils

7,695
3
34
73

Yes, they have two NICS. The frontend Nic is used for booting - the other nic is on the backend and is used for communication between clusters of nodes, so we wouldn't be using that for pxe booting since those nics are not registered for such. – Brendon Martino Jan 06 '12 at 12:26
Interestingly, as this problem apparently predates my experience at this company, they have also seen this happen before the OS gets installed - e.g. while it is attempting to pxe boot, and, although I have never seen this personally, if that is true, the OS workaround would not work in that case, since there is no OS at that point... – Brendon Martino Jan 06 '12 at 12:28
Also, we just saw this happen on a "bigger brother" - the R610 - for the first time - yesterday....same driver though (broadcom bnx2). – Brendon Martino Jan 06 '12 at 12:31
@nils Are you running Windows? or *nix flavored OS? Sure the windows drivers for broadcom equipment works most of the time... it's *nix flavored drivers that are terrible. – TheCompWiz Jan 06 '12 at 15:00
@TheCompWiz we run "both" W2K8R2 and Linux. Regarding the drivers - I did not notice problems there. But those Broadcoms are not that capable as Intels. Even Dell said we should use an PCIe Intel on a network-io-heavy-system instead. – Nils Jan 06 '12 at 20:05
If this is even happening BEFORE boot enabling the BIOS boot retry loop might help. – Nils Jan 06 '12 at 20:06
I'll check to see if that's an option...however we are trying to get to the root cause of the problem, instead of slapping a band-aid on the problem... – Brendon Martino Jan 06 '12 at 20:11
Sure. Have the checks of your switches revealed anything interesting? – Nils Jan 06 '12 at 20:30
Nope. Just that the Nic is unreachable when this happens. – Brendon Martino Jan 09 '12 at 15:43

Phantom NIC issue causing eth0/1 to drop out

3 Answers3