This is a rather complex problem, but I'll try to make it easy to understand:
I have three subnets. We'll call them 10.10.0.0/22, 10.20.0.0/22 and 10.30.0.0/16. I have two AD domains, but I don't think that is significant.
10.30.0.x subnet is where most of the machines on my network live. The 10.10 and 10.20 subnets are dedicated to heavy traffic between servers. (Storage and Virtual Machine migration.)
The domain controllers (and DNS servers) have interfaces on all three subnets, so they can authenticate machines on all three subnets. Therefore, we have enabled subnet prioritization on the DNS servers, meaning that they will attempt to order the DNS results with a preference towards the client's subnet. So for example, I make a request from the 10.30 subnet for server.company.com, (which also has interfaces on all three subnets) it will return all three IP addresses for that machine, but it will return them in the order 10.30.0.5, 10.20.0.5, 10.10.0.5. (The last two may be reversed)
Everything in DNS seems to be working as expected. My workstation is only on the 10.30 subnet. However, when I ping server.company.com, it always resolves to either 10.20.0.5 or 10.10.0.5, and never to 10.30.0.5. I ran a Wireshark capture of the DNS traffic, and the DNS servers are definitely returning the results in the correct order. However, my client is ignoring the 10.30 entry entirely. It always resolves to 10.20 or 10.10, depending on which is next in the DNS reply. nslookup queries always seem to look correct, but nslookup doesn't actually resolve the query, it only provides the DNS server's answer, which, as far as I can see, is correct.
I'm running Windows 10, with all available Windows Updates. (No insider components) I've confirmed this same behavior on at least four other machines that are running Windows 10, but some other machines running Windows 10 work correctly. I have tested on Windows 8, 7, All recent versions of Server, and several Linux machines, and all resolve correctly. It's only come up on some Windows 10 machines, although I can't conclusively rule out the possibility it's happening elsewhere.
Here's where it gets weird: I can edit the HOSTS file on my machine, and if the 10.30 entry is the only entry for the server, it resolves correctly. But if there are any other options, it will choose that. It doesn't make a difference if the 10.30 entry is first on the list.
And then really weird:
I can RDP to server.company.com just fine, 100% of the time. I open a cmd window and ping or tracert server.company.com, and it resolves to 10.10.0.5. I type mstsc -v:10.10.0.5
and it times out, so it is not reachable at that address. (And it should not be) However, mstsc -v:10.30.0.5
and mstsc -v:server.company.com
actually work, meaning that ping doesn't seem to be using the same resolution mechanism as Microsoft RDP Client. I don't clear my DNS cache, either. In fact, the entries are listed in the correct order when I type ipconfig /displaydns
. Server Manager (RSAT) can seemingly manage the server from my workstation, but Hyper-V manager, cannot. RPC appears to work, but I get strange authentication errors with some functions. For some reason, some applications are just completely skipping over the 10.30 subnet.
Could there be something in Group Policy that tells Windows to de-prioritize certain subnets? All subnets are listed in Active Directory Sites and Services, and I don't think I've done anything fancy there. (There is only one site.) Is there anything else that might cause name resolution to skip a subnet for some reason?
EDIT: I greatly appreciate the advice against multi-homing DCs in Active Directory, however, that was done out of necessity, and I don't believe is a factor in the problem. Wireshark traces show conclusively that the DNS list is coming back in the correct order, with the local subnet as the first entry. However, Windows, and it looks specifically like ICMP, are choosing to ignore that entry and use the second one that is returned. (Unless Windows is using some other means of address resolution, and before you ask, NetBIOS is disabled. :) )