I ran into a perplexing blocker at our main site this week, and I'd like to understand the root cause. Process-wise, my troubleshooting broke down and a confusing configuration change fixed my issue.
Sorry for the wall of text, but if anyone could help me, I figure the more details the better.
TL;DR: we had external DNS lookups fail, from domain computers and our domain controller, until I adjusted the DNS forwarder resolution timeout from the default 3 seconds to 10. Lookups were failing in NSLookup even when I manually specified an external server. Details below.
On Friday, I swapped the address of our two new domain controller/DNS/DHCP servers with the addresses of two old DNS/DHCP/AD controllers. The old servers were Windows 2008 R2, the new 2012 R2 with SP1, fully-patched to latest Windows Updates. I also backed up and restored the DHCP settings (with Windows Powershell) to the new DHCP servers and set up a failover relationship with the new Windows DHCP load balancing.
I restarted, ran dcdiag /fix, and verified DHCP and DNS lookups. Everything appeared to be working fine. I checked the next few days to make sure that the DHCP server transfer worked correctly and that machines were getting the appropriate leases. I don't recall completely if the DNS lookups I tried included an external site or not, but I think I did try nslookup www.google.com and it worked on Friday right after the change.
Sunday a user first reported a problem accessing the network. Monday, it was a widespread issue. User's couldn't access the Internet, or could intermittently access it. Also, our IP phones were staticky, similar to what we've seen before with significant congestion. VoIP phones are on a separate VLAN and don't connect to our DNS servers or firewall, except to lookup user directory entries. Other than the phone symptoms, the problem seemed ONLY to be DNS related. We run a split-brain DNS server, so our externally hosted sites that had a local DNS entry worked fine and loaded quickly. I also could SSH to an external server. There was no packet loss between any of the servers locally, and our VPN tunnel to a remote site was working fine. Over the VPN to our remote site, I could access the Internet fine--the DNS server at the remote site is also a domain controller running 2012 R2.
Using DNS lookup on a workstation and on a DC, I tried various lookups. Internal lookups were great, external lookups failed. On both of the new DNS servers, the external DNS forwarders were set to our ISP and Google public DNS. DNS root hints were also enabled. I specified the external DNS server by opening NSLookup, typing server 8.8.8.8, and then trying a few websites--www.google.com, www.microsoft.com. They failed, reporting 4 DNS server timeouts. I also tried manually setting the DNS server to my known working DC, the one in the remote site, without success.
I tried similar tests on the old DCs with the new IP addresses, with no change in the failures. I confirmed that the DNS servers on the local DCs were set to a different DC as the primary, as recommended in Microsoft Technet. Finally, to narrow down problems, I shut down the old DCs.
I first suspected the firewall. We have a firewall rule that allows DNS UDP and TCP on port 53 from any/to any. IPS was turned on for this rule. I did see an alert about a UDP flood from one DC, as well as some blocked BootP packets from the domain controller. I turned off IPS for that rule, and the alert went away without fixing the symptom. However, in the past I've seen some odd behavior with our firewall when it was turned on for a long time, so I tried restarting it, with no change in behavior. I also swapped the cable from the firewall to our switch and restarted the modem to our primary ISP.
Research led me to notes about EDNS causing problems with some firewalls. I turned off EDNS on both of the new DCs by typing in
dnscmd /config /EnableEDNSProbes 0
on both domain controllers and restarting the DNS server service. No change in behavior.
I tried adding and changing the order of the DNS forwarders on both domain controllers, with no change in behavior.
I tried changing the ISP I was using--we have 2 separate ISPs. In the firewall, I changed the precedence order, and verified with a tracert that I was connected to the right ISP. I tried the same nslookups, manually specifying the DNS server as 8.8.8.8, and got the same failures on both my local workstation and the Domain Controller. Intermittently, I got a successful lookup on the domain controller, but not consistently.
Packet captures on the firewall were showing queries making it to the firewall, but I wasn't seeing responses from the external DNS server.
I also tried dcdiag /testdns and it reported that all of the forward servers failed, and all of the root hints failed, on both domain controllers.
Finally, as a last ditch effort without thinking it would help anything, I increased the DNS forwarder lookup timeout from the default 3 seconds to 10 seconds on both domain controllers, by going to Domain Controller Properties | Forwarders | Edit, and updating "number of seconds before forward queries timeout" to 10. Instantly, everything started to work. The only other fairly recent change--I had just a few minutes earlier turned back on the old DCs.
- Why did changing this timeout fix anything? Could it have?
- Why did manually specifying the server with nslookup NOT actually appear to test connectivity to the external server? Why would the DNS forward query timeout on the domain controller have anything to do with my Windows desktop client? Am I misunderstanding this option in Windows 7 and Server 2012 R2 version of nslookup?
- Did this actually fix the problem, or was it coincidence? It was an instant resolution, so I'm disinclined to believe it wasn't the fix. I had a browser open and was refreshing it--as soon as I made the configuration change, it worked.
- Should I be looking for a failure somewhere else, such as on my switch? Is it possible that some table filled up from the stored-up DNS queries? The switch is the common factor between phones/Internet, but the narrow problem that showed up as DNS lookup failures and otherwise fine network access internally makes me discount this.
One further detail: when I open the DNS server properties now, under Forwarder, it shows the timeout setting as "3 seconds" again, although I never changed it back.
This all really has me doubting my ability to diagnose this kind of issue in the future. I knew to check on DNS servers because of the recent change, but I really don't understand why my NSLookup failures with the manually specified DNS server were timing out--it led me down the wrong troubleshooting path if this really truly was the fix. I expected that when my nslookup queries with the manually specified server failed, it had to be a firewall issue, not a configuration issue on the domain controller.