0

So... We have a few servers that we would like load balanced link aggregation between multiple network interfaces. In the past we have used broadcom basp teaming using LACP.

Today we encountered THIS problem. An 'A' record for one of our DCs disappeared and would not recreate. It co-incided with configuring a different DC with a LACP nic team.

We have broken the team now and things seem normal again, but wtf is happening here? Is LACP a bad idea for windows servers 'full stop'? Is it just DCs?

Edit: It looks like 2012 implements LACP, I should make it clear that I am using 2008 r2.

Matt
  • 1,142
  • 1
  • 12
  • 32
  • Is that DC a DNS server? – Shane Madden Nov 28 '13 at 01:42
  • It shouldn't break the A record. I wonder if it's related to which mac address gets used and silly arp cache issue on the DC. – hookenz Nov 28 '13 at 01:43
  • @ShaneMadden: The DC that LACP was configured on was not a dns server. The host for the A record that was lost is a DC with dns. – Matt Nov 28 '13 at 02:03
  • @Matt: An interesting thought but in an LACP group isn't a single MAC used? – Matt Nov 28 '13 at 02:04
  • @Ablue A DC that runs DNS has a bit of a novel method of putting its `A` records in DNS - the DNS server process adds an `A` record directly for each listening interface configured for the DNS service. Would any changes you were making at the time have affected the DNS service on that system? – Shane Madden Nov 28 '13 at 02:09
  • @ShaneMadden: There were no changes on any server running dns, but the zone affected was an AD integrated zone. Could this be a factor? – Matt Nov 28 '13 at 02:11
  • @Ablue That shouldn't affect anything.. is scavenging set up for the zone on any of the DNS servers? – Shane Madden Nov 28 '13 at 02:22
  • @Ablue does multiple `A` records with the same FQDN/label as the one that disappeared exist in the zone? – Mathias R. Jessen Nov 28 '13 at 02:24
  • @ShaneMadden: No these records have no scavenging set. – Matt Nov 28 '13 at 04:01
  • @MathiasR.Jessen: No – Matt Nov 28 '13 at 04:02
  • @Ablue Did the record actually get deleted, from all DNS servers that have the zone? Or did it just not resolve from some places? I'm having a hard time imagining how it might have gotten deleted. – Shane Madden Nov 28 '13 at 04:11
  • @ShaneMadden: Yes, it was not present in the snap-in and nslookup said it didnt exist either. I couldn't believe such a problem could exist. I was very surprised. – Matt Nov 28 '13 at 05:35
  • @Ablue Interesting. Do you by any chance have the `Active Directory Changes` audit subcategory set to log successful changes on your DCs? I'm curious what happened to that entry's object. – Shane Madden Nov 28 '13 at 05:36
  • @ShaneMadden: I don't. I am hesitant to try recreating the problem to gather information. – Matt Nov 28 '13 at 05:39
  • @Ablue Heh, fair enough. – Shane Madden Nov 28 '13 at 05:40
  • @ShaneMadden:Can I ask you what your experience with link aggr in windows server has been? Do you use it? Do you do something different? Do you have any suggestions for safely load balancing network links in windows? – Matt Nov 28 '13 at 05:43
  • @Ablue My experience with it is pretty limited, but, not good. The vendor-provided aggregation setups and their wacky virtual interfaces have caused me nothing but problems. I haven't actually messed with the implementation in 2012, either, but I'd avoid aggregation until after upgrading to 2012. – Shane Madden Nov 28 '13 at 05:45
  • @ShaneMadden: I really hoping that wasn't my only option. – Matt Nov 28 '13 at 05:56
  • @Ablue Well, what's the need for LACP? Is it for redundancy or added bandwidth? If it's redundancy, then I'd say get application redundancy in place in other ways - AD domains are very good about dealing with a domain controller being down. – Shane Madden Nov 28 '13 at 06:01
  • @ShaneMadden I am not really concerned about DCs having LACP. It was other 2008 r2 servers; like a file server, or a CAS etc. This fault has made me question my assumptions around lacp and windows. In those situations I would really like additional bandwidth, but full lbfo would be ideal. – Matt Nov 28 '13 at 06:05
  • @ShaneMadden The inbuilt windows 'network bridge' creates a virtual interface that uses spanning tree to run a active/standby HA for network interfaces. – Matt Nov 28 '13 at 06:10
  • Eek, I wouldn't trust a bridge to speak spanning tree properly to the switches and not create a loop. The approach that I've always taken would be to either cluster or DFS the file server, and stick multiple CAS servers behind reverse proxies or load balancers that do health checks. Because NIC redundancy has been unsupported by MS for so long, I don't think it's very common to use LACP on physical Windows servers pre-2012. – Shane Madden Nov 28 '13 at 06:16
  • let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/11722/discussion-between-ablue-and-shane-madden) – Matt Nov 28 '13 at 06:19

2 Answers2

2

Teaming with Broadcom or Intel is very widely used and in my experience has caused no problems. I am not saying that a particular driver could not be buggy but it seems that jumping to conclusion that LACP is flaky in general is a bit of a red herring. I would suspect a misconfiguration in the process of setting it up is far more likely.

JamesRyan
  • 8,166
  • 2
  • 25
  • 36
  • Same here, I've been using Broadcom and Intel teaming for many years on physical domain controllers (2003 R2 and 2008 R2) without problems. – pauska Nov 28 '13 at 10:51
  • So as far as you are concerned the causes for the described issue are limited to a bad driver or mis-configuration? – Matt Nov 28 '13 at 22:39
  • Seen this one? Might be to do with service start order http://serverfault.com/questions/120052/windows-2003-domain-controller-very-upset-about-nic-teaming?rq=1 – JamesRyan Nov 29 '13 at 11:05
0

Looks like it is probably not a good idea to use third party link aggregation software on windows server.

Possible solutions include:

  • Upgrade to 2012
  • P2V servers
  • Cluster instead
  • Commit seppuku
Matt
  • 1,142
  • 1
  • 12
  • 32