3

Update: I have now performed the upgrade. I used the half-ninja, half-hack solution of plugging in USB-to-Ethernet adapters that I could add to the teams to hold the fort. I plugged in one per team, removed the other affected adapters in the team, shut down Windows, swapped out the card, made sure the USB adapters were in the same USB port and would connect in the same way and booted up. The USB adapters were still there and I was able to restore the team configuration by manually adding the new NICs to the teams.

This solution was first proposed by @Drifter104 in a comment. @shouldbeq931 was the first answer to propose adding another card to sidestep the problem and received the bounty. Both answers were helpful so in fairness I am marking @llorrac's exhaustive top-voted answer as the answer, which pointed out the importance of removing the NICs on the broken card from the teams before swapping it out.

I still don't exactly know what happens when you don't do this or what Microsoft's guidance for swapping out cards are - but that's Microsoft's fault and I appreciate the help I got here.


Original question: I am administering a Windows Server 2012 R2 cluster running Hyper-V workloads. All cluster nodes have several networks, served by several physical network cards, where Windows Server's NIC teaming is used to team two ports together (the teams never span physical network cards). A port on a physical network card on one of the cluster nodes has recently experienced failure, and that port has been removed from its team and a new physical network card of identical make and model has been ordered.

  • If I replace the card as is and plug everything in the same way, will everything be picked back up by the NIC teaming? By cluster networks? The physical network card will be in the same slot and of the same model, but the MAC addresses will clearly be different, and I don't know if the tags that Dell put on the various ports to correlate them (there's an acronym for this but it eludes me) will be available.

  • If not, will I need to tear everything down and reconfigure the teams/cluster networks?

  • Is there any good official guidance or other advice about how to go about this? I haven't found anything, but I don't know quite what to search for. (The closest is this forum thread which is written back when network teaming wasn't provided by Windows Server and someone had to use a hardware solution from the vendor, so Microsoft's response to that situation was "you're on your own".)

Edit: hopefully this question will answer the general question "does stuff break and if so, how can I avoid that", but I realize that more details will be helpful so I am providing them.

The server has a total of six ports, divided on two cards. One card has two 10 Gbit ports, and a team spanning both ports. One card has two 10 Gbit ports and a team spanning both ports, as well as two 1 Gbit ports and a team spanning both ports. The 1 Gbit team is hooked up to our general network switch. The two 10 Gbit teams are hooked up point-to-point directly to our storage server and to the other cluster node, and the networking all works out with hard coded IP addresses and without a switch. (This works but I would not recommend it, nor would I repeat it in a new configuration. So yes, I know that it is horrible and prevents a bunch of useful things with VLAN and network hygiene. As far as I can tell it doesn't have an impact on what I'm asking, which is how Windows Server NIC teaming reacts to changed hardware.) The malfunctioning port is in one of the 10 Gbit teams. All teams use the Switch Independent teaming mode (since there's no switch).

Jesper
  • 115
  • 1
  • 2
  • 12
  • 2
    You can't create a team without adding at least 1 NICs, so it would stand to reason that if all the NICs were removed it would cease to exist. That said it only has to have 1 NIC present for it to continue to exist. Simply buy a USB to Ethernet adaptor add it to the team, remove the card, add the two new nics to the team and then remove the USB one. – Drifter104 Aug 09 '17 at 09:40
  • @Drifter104: this is genius and I think this will work. I'll prepare for it not working but this would save a lot of hassle if it did. – Jesper Aug 09 '17 at 10:02

2 Answers2

5

This is an important question and I'd say a more common scenario than it appears from your searching.

As you may know there are three types of teaming provided by MS Server. 1. Active / Standby 2. Static 3. LACP

Based on your statement about whether you will have to

tear everything down

it sounds to me like you are using Static teaming which requires more manual config than the other two.

Regarding replacing the NIC.

Despite which teaming you use, you have to make sure that your dead NIC is disabled in the team settings before unplugging anything!!!

Will it be picked up by teaming when you plug the new NIC in? Yes, but depending on which configuration you're using, you may need to manually add it to your team.

  1. Remove NIC from team
  2. Remove physical NIC
  3. Replace physical NIC
  4. Add new NIC to team

Check out this document from Microsoft tech net for reference - 4.6 Checking the status of a team. There are options for editing team settings visually or through powershell.

Regarding MAC address and cluster networks.

Again, per documentation, the receivers of teamed data will be resolving the single IP and rest on one primary MAC address from the pool. As such, if you follow the steps in the attached documentation, you should run into no errors with MAC address config.

In summary.

I once had to conduct a post incident review in a similar situation. The engineer planned to shut down a switch to replace it, but didn't remove it from the pool. This meant that when he shut the switch down, all network traffic was lost and caused errors to be played out to +250k end user devices. ¯_(ツ)_/¯

Check out the docs - there is some other stuff specific to hyper-v that might make more sense to you.

llorrac
  • 166
  • 3
  • Thanks for the exhaustive info. We are using the Switch Independent teaming mode and not the Static Teaming mode. It sounds like this changes your answer? (The reason it's Switch Independent is because we are actually not using a switch for this, for various reasons.) – Jesper Aug 08 '17 at 13:51
  • I am also seeing that for natural reasons, I can't remove all NICs from the team. (Each team contains two ports from the same physical network card. This is because we are using three teams of two ports each from one physical network card with four ports and one with two ports - we couldn't spread all of them out on two cards each even if we'd wanted to. So it seems that I will have to remove the team configuration completely. – Jesper Aug 08 '17 at 14:11
  • Okay your original question is clearer now that you mentioned you're using multiple teams. You actually _are_ spanning a network card, but not spanning multiple* cards. The teaming config doesn't impact this workflow, nor does the existence or non-existence of a switch. But, multiple teams on single interface does. I'm still not 100% on your architecture, and it would be helpful to know which NIC has the dead port - but I guess it doesn't matter for the flow. My take on your last comment is that you're aggregating the 4x NIC into 2 teams, and the 2x NIC into 1 team. – llorrac Aug 09 '17 at 06:41
  • If this is the case, you need to add additional steps: 1 Check vlan(s) of team(s) hanging off NIC to be replaced. 2 Add new NIC. 3 Follow step _4.7.3 - Adding New Team Interface_. 4 Ensure new interface is on same vlan as original team(s). 5 Make new interface default 6 Disable ports on original NIC in team(s). 7 Remove the NIC with the dead port. Let me know if this helps, and I will update my original response. – llorrac Aug 09 '17 at 06:42
  • Sadly, due to the server's configuration, I can't add a new card, only replace one of the existing cards. And yes, on the physical card with four ports, there are two teams with two ports each and on the physical card with two ports, there's one team. (The reason I'm saying card instead of NIC is that I was taught that a physical card was the NIC, but Teaming seems to use "NIC" to mean "individual port/interface with a MAC address", so I just want to avoid confusion.) – Jesper Aug 09 '17 at 09:05
  • I edited the question to add more information. – Jesper Aug 09 '17 at 09:20
  • Okay! Do you have some change management processes around your services? You may need to take down your hyper v services before swapping out the hardware to avoid network loss errors. As someone else said, if the hardware is removed completely, the teams won't exist. If you can note the VLAN and IP of the team, you should be able to bring everything back up in the same order once the new hardware is installed without too much hassle. – llorrac Aug 09 '17 at 12:02
  • It looks like I'll either have to do that or use the brilliant proposal by Drifter104 to plug in USB-to-Ethernet adapters to "hold the fort" in each team and keep them alive even as I remove the ports from the old card. It's a bit hacky, but of course not going to be used in production, just to maintain the existence of the team so that I can use a clean method. – Jesper Aug 09 '17 at 13:06
  • It is brilliant idea and I hope it works. Please update if it does. – llorrac Aug 10 '17 at 22:49
2

Windows abstracts the underlying NICs in a Team, when a NIC is removed from a team, and new NIC added to the Team the Team remains the same, as long as there is a NIC in the team the Team configuration lasts. If you remove all of the NICs from a Team, there is no Team left.

Depending on your configuration, ability for maintenance windows, and free PCIe slots, you might prefer to add an additional NIC to the team before removing the failed NIC.

I always build teams across multiple NICs so that in the event of a NIC failure, the team will remain up. I also tend to build teams over different NIC vendors so that in the event of a "faulty" NIC driver being deployed the Team will still stay up

shouldbeq931
  • 509
  • 4
  • 15
  • We won't have the ability to add another NIC. It looks like we'll have to tear down the configuration. Is it defined or described how clustering will react to this, or how to "stop everything" so that this can be done without anything panicking and potentially overcompensating? – Jesper Aug 09 '17 at 09:22
  • from https://gallery.technet.microsoft.com/windows-server-2012-r2-nic-85aa1318 "Any Ethernet NIC that has passed the Windows Hardware Qualification and Logo test (WHQL tests) may be used in a Windows Server 2012 R2 team." so a USB NIC Might suffice for keeping the team configuration. – shouldbeq931 Aug 10 '17 at 18:54
  • Awarding the bounty to this answer since it was the first to propose that a new NIC is added in the meantime which, it appears, will let me sidestep the nastiness, which is very likely the first preference of anyone confronted with this situation. If @Drifter104's comment had been an answer and not a comment (I know the software sometimes downgrades short answers to comments) I would have given the bounty to them, since it was the USB NIC that was the crucial part. – Jesper Aug 14 '17 at 07:47