4

This is a rather complex problem, but I'll try to make it easy to understand:

I have three subnets. We'll call them 10.10.0.0/22, 10.20.0.0/22 and 10.30.0.0/16. I have two AD domains, but I don't think that is significant.

10.30.0.x subnet is where most of the machines on my network live. The 10.10 and 10.20 subnets are dedicated to heavy traffic between servers. (Storage and Virtual Machine migration.)

The domain controllers (and DNS servers) have interfaces on all three subnets, so they can authenticate machines on all three subnets. Therefore, we have enabled subnet prioritization on the DNS servers, meaning that they will attempt to order the DNS results with a preference towards the client's subnet. So for example, I make a request from the 10.30 subnet for server.company.com, (which also has interfaces on all three subnets) it will return all three IP addresses for that machine, but it will return them in the order 10.30.0.5, 10.20.0.5, 10.10.0.5. (The last two may be reversed)

Everything in DNS seems to be working as expected. My workstation is only on the 10.30 subnet. However, when I ping server.company.com, it always resolves to either 10.20.0.5 or 10.10.0.5, and never to 10.30.0.5. I ran a Wireshark capture of the DNS traffic, and the DNS servers are definitely returning the results in the correct order. However, my client is ignoring the 10.30 entry entirely. It always resolves to 10.20 or 10.10, depending on which is next in the DNS reply. nslookup queries always seem to look correct, but nslookup doesn't actually resolve the query, it only provides the DNS server's answer, which, as far as I can see, is correct.

I'm running Windows 10, with all available Windows Updates. (No insider components) I've confirmed this same behavior on at least four other machines that are running Windows 10, but some other machines running Windows 10 work correctly. I have tested on Windows 8, 7, All recent versions of Server, and several Linux machines, and all resolve correctly. It's only come up on some Windows 10 machines, although I can't conclusively rule out the possibility it's happening elsewhere.

Here's where it gets weird: I can edit the HOSTS file on my machine, and if the 10.30 entry is the only entry for the server, it resolves correctly. But if there are any other options, it will choose that. It doesn't make a difference if the 10.30 entry is first on the list.

And then really weird: I can RDP to server.company.com just fine, 100% of the time. I open a cmd window and ping or tracert server.company.com, and it resolves to 10.10.0.5. I type mstsc -v:10.10.0.5 and it times out, so it is not reachable at that address. (And it should not be) However, mstsc -v:10.30.0.5 and mstsc -v:server.company.com actually work, meaning that ping doesn't seem to be using the same resolution mechanism as Microsoft RDP Client. I don't clear my DNS cache, either. In fact, the entries are listed in the correct order when I type ipconfig /displaydns. Server Manager (RSAT) can seemingly manage the server from my workstation, but Hyper-V manager, cannot. RPC appears to work, but I get strange authentication errors with some functions. For some reason, some applications are just completely skipping over the 10.30 subnet.

Could there be something in Group Policy that tells Windows to de-prioritize certain subnets? All subnets are listed in Active Directory Sites and Services, and I don't think I've done anything fancy there. (There is only one site.) Is there anything else that might cause name resolution to skip a subnet for some reason?

EDIT: I greatly appreciate the advice against multi-homing DCs in Active Directory, however, that was done out of necessity, and I don't believe is a factor in the problem. Wireshark traces show conclusively that the DNS list is coming back in the correct order, with the local subnet as the first entry. However, Windows, and it looks specifically like ICMP, are choosing to ignore that entry and use the second one that is returned. (Unless Windows is using some other means of address resolution, and before you ask, NetBIOS is disabled. :) )

C Hamm
  • 81
  • 1
  • 4
  • You've stated multiple times that DNS resolution is working correctly and that the A records are returned in the correct order so stop focusing on name resolution. Focus on the ping. Ping selects the wrong ip address. All other applications appear to work correctly and connect to the correct ip address for the A record dependent upon what subnet the client is in. What is it about ping that makes it behave differently? Do any other applications exhibit the same behavior as ping? – joeqwerty Jan 27 '16 at 23:11
  • Thanks - I troubleshot this for hours and hours as a DNS issue, and seem to have ruled that out, but I'll edit the question a bit later to put a sharper point on the question. – C Hamm Jan 28 '16 at 00:25
  • Ping is not the only application affected. On the surface, tracert is also resolving the same way, and there are several other applications that aren't working correctly, all of which are explained by the application using the incorrect address. However, I'll need to do some testing (running Wireshark) to test if, for example, a web browser will send traffic to the correct address. Good thing to test, and thank you for the direction! – C Hamm Jan 28 '16 at 00:29
  • As a side note, ping and tracert are essentially the same thing in this scenario. They're both using ICMP. So is there something with ICMP that causes it to behave in this manner? – joeqwerty Jan 28 '16 at 00:36
  • Also, to be clear, I think that name resolution IS to some degree the problem, just not DNS, and oddly, not across the network stack. The behavior is consistently wrong, whether it resolves using the HOSTS file, or the data returned by DNS. – C Hamm Jan 28 '16 at 00:37
  • It's possible - is it likely that Hyper-V manager uses ICMP when determining where to send traffic? That application returns a "server not responding" type message. I honk maybe a good test would be to use some applications other than ping to directly resolve the IP address and see what I get. – C Hamm Jan 28 '16 at 00:41
  • 1
    At the start you did a design error, multihoming a dc is a error. No routing between the VLAN ?? – yagmoth555 Jan 28 '16 at 00:41
  • Anyhow, google multihoming, you will find plenty of user with symptom like yours, as the dc nic register itselft for the 3 ip in the dns. I will write a answer later, on my phone atm – yagmoth555 Jan 28 '16 at 00:53
  • I installed a 3rd party IP toolkit, and it looks like it has cast more light - using an application like finger or IP lookup, it produces the correct address, but using tracert wishing the same application produces the wrong address. It looks like it is likely that it is tied to ICMP – C Hamm Jan 28 '16 at 00:55
  • I'll need to look up the reason to be sure, but there was a design reason we needed to multihome the servers. I believe it had something to do with the cluster needing to be able to reach DNS servers across two separate subnets for fault tolerance. It initially failed when validating the cluster, which is why we added interfaces on the cluster subnets. – C Hamm Jan 28 '16 at 01:00
  • It sounds like you could get **much** better results by ensuring you have a layer-3-capable switch, and using that to route between your vlans. This would allow you to avoid multi-homing anything. Multihoming is **definitely** a factor in your current problem. – Joel Coel Feb 16 '16 at 19:35

4 Answers4

3

If you really need to multihome your DC, please follow these step. I took them from there. The document link to old KB, but it's a updated document by a know blogger.

The following are the manual steps to configure a Multihomed DC

  1. Insure that all the NICS only point to your internal DNS server(s) only and none others, such as your ISP’s DNS servers’ IP addresses.

  2. In Network & Dialup properties, Advanced Menu item, Advanced Settings, move the internal NIC (the network that AD is on) to the top of the binding order (top of the list).

  3. Disable the ability for the outer NIC to register. The procedure, as mentioned, involves identifying the outer NIC’s GUID number. The following link will show you how:

246804 – How to Enable-Disable Windows 2000 Dynamic DNS Registrations (per NIC too): http://support.microsoft.com/?id=246804

  1. Disable NetBIOS on the outside NIC. That is performed by choosing to disable NetBIOS in IP Properties, Advanced, and you will find that under the “WINS” tab.

You may want to look at step #3 in the following article to show you how to disable NetBIOS on the RRAS interfaces if this is a RRAS server.

Chapter 11 – NetBIOS over TCP/IP http://technet.microsoft.com/en-us/library/bb727013.aspx

Or enable/disable NetBIOS on an interface in the registry:

To do it in the registry but you will need to identify the GUID of that interface – (this may not apply to PPP interfaces) HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\NetBT\Parameters\Interfaces, find the GUID(s) with NetbiosOptions set to 0 and set them to 2. Using WMIC:

First, get the list of interfaces: wmic nicconfig get caption,index,TcpipNetbiosOptions

Then use the “index number” in the next command: wmic nicconfig where index=1 call SetTcpipNetbios 2

SetTcpopNetbios options are:

0 – Use NetBIOS setting from the DHCP server 1 – Enable NetBIOS over TCP/IP 2 – Disable NetBIOS over TCP/IP

More info on the wmic commands and the registry entries can be found in this forum thread link:

Thread – Configuring NetBIOS over TCP/IP http://social.technet.microsoft.com/Forums/en-US/winservercore/thread/d18bd172-e1a0-4a61-ba52-0952a1e3cabc/

Configure TCP/IP to use WINS http://technet.microsoft.com/en-us/library/cc757386(WS.10).aspx

Note: A standard Windows service, called the “Browser service”, provides the list of machines, workgroup and domain names that you see in “My Network Places” (or the legacy term “Network Neighborhood”). The Browser service relies on the NetBIOS service. One major requirement of NetBIOS service is a machine can only have one name to one IP address. It’s sort of a fingerprint. You can’t have two brothers named Darrell. A multihomed machine will cause duplicate name errors on itself because Windows sees itself with the same name in the Browse List (My Network Places), but with different IPs. You can only have one, hence the error generated.

  1. Disable the “File and Print Service” and disable the “MS Client Service” on the outer NIC. That is done in NIC properties by unchecking the respective service under the general properties page. If you need these services on the outside NIC (which is unlikely), which allow other machines to connect to your machine for accessing resource on your machine (shared folders, printers, etc.), then you will probably need to keep them enabled.

  2. Uncheck “Register this connection” under IP properties, Advanced settings, “DNS” tab.

  3. Delete the outer NIC IP address, disable Netlogon registration, and manually create the required records

    a. In DNS under the zone name, (your DNS domain name), delete the outer NIC’s IP references for the “LdapIpAddress”.

    b. If this is a GC, you will need to delete the GC IP record as well (the “GcIpAddress”). To do that, in the DNS console, under the zone name, you will see the _msdcs folder. Under the _msdcs folder, you will see the _gc folder. To the right, you will see the IP address referencing the GC address. That is called the GcIpAddress. Delete the IP addresses referencing the outer NIC.

        1. To stop these two records from registering that information, use the steps provided in the links below:
    
            Private Network Interfaces on a Domain Controller Are Registered in DNS
            http://support.microsoft.com/?id=295328 
    
         2.. The one section of the article that disables these records is done with this registry entry:
    
               HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters
               (Create this Multi-String Value under it):
                Registry value: DnsAvoidRegisterRecords
                Data type: REG_MULTI_SZ
                Values: LdapIpAddress
                              GcIpAddress
    
                 The following link provides more information on the LdapIpAddress and GcIpAddress, as well as other Netlogon Service records:
    
                  Restrict the DNS SRV resource records updated by the Netlogon service[includingGC]:
                  http://technet.microsoft.com/en-us/library/cc778029(WS.10).aspx
    
    
            3. Then you will need to manually create GcIpAddress and IpAddress records in DNS with the IP addresses that you need for the DC. To create the LdapIpAddress, manually create a new host under the domain, but leave the “hostname” field blank, and provide the internal IP of the DC, which results in a record that  looks like:
    

    (same as parent) A 192.168.5.200 (192.168.5.200 is used for this example)

             4. You need to also manually create the GcIpAddress as well, if this is a GC. That would be under the _msdcs._gc SRV record under the zone. It is created in the same fashion as the LdapIpAddress mentioned above.
    
  4. In the DNS console, right click the server name, choose properties, then under the “Interfaces” tab, force it only to listen to the internal NIC’s IP address, and not the IP address of the outer NIC.

  5. Since this is also a DNS server, the IPs from all NICs will register, even if you tell it not to in the NIC properties. See this to show you how to stop that behavior (this procedure is for Windows 2000, but will also work for Windows 2003):

275554 – The Host’s A Record Is Registered in DNS After You Choose Not to Register the Connection’s Address: http://support.microsoft.com/?id=275554

  1. Disable the round robin functionality on the DNS server. To do so: (This step added 5/2010) 1. Click Start, click Settings, click Administrative Tools, and then click DNS. 2. Open the properties for the DNS server’s name.

  2. If you haven’t done so, configure a forwarder. You can use 4.2.2.2 and 4.2.2.3, if not sure which DNS to forward to until you’ve got the DNS address of your ISP. How to set a forwarder? Good question. Depending on your operating system, choose one of the following articles, depending on your operating system.

300202 – HOW TO: Configure DNS for Internet Access in Windows 2000 http://support.microsoft.com/?id=300202

323380 – HOW TO: Configure DNS for Internet Access in Windows Server 2003 (How to configure a forwarder): http://support.microsoft.com/d/id?=323380

Configure a DNS Server to Use Forwarders – Windows 2008 and 2008 R2 http://technet.microsoft.com/en-us/library/cc754941.aspx

Active Directory and NAT

I thought to touch base on this overlooked fact about AD communication through a NAT.

If a planned resources is to be provided in the AD infrastructure that uses AD authentication (Kerberos) that must traverse a NAT, it basically won’t work. This is due to secure RPC communications and NAT not being able to translate the traffic due to the encryption. If you really need to make it work, there are solutions to work around it, such as a Direct VPN between the services across the NAT devices, or additional NICs directly connecting them. More on it in this link, and Microsoft’s take and solution on it:

Description of support boundaries for Active Directory over NAT http://support.microsoft.com/default.aspx?scid=kb;en-us;978772&sd=rss&spid=12925

Active Directory communication fails on multihomed domain controllers http://support.microsoft.com/kb/272294

Source IP address selection on a Multi-Homed Windows Computer

There is often confusion about how a computer chooses which adapter to use when sending traffic. This blog describes the process by which a network adapter is chosen for an outbound connection on a multiple-homed computer, and how a local source IP address is chosen for that connection.

Source IP address selection on a Multi-Homed Windows Computer http://blogs.technet.com/b/networking/archive/2009/04/24/source-ip-address-selection-on-a-multi-homed-windows-computer.aspx

yagmoth555
  • 16,758
  • 4
  • 29
  • 50
  • Thank you! This is an incredible resource, and if you haven't already, I hope it is published somewhere easy to find. I'll go through each point with a fine-tooth comb, but I believe that we have done all, or at least most of these things. I believe the DNS is working correctly, but I could see one of the minor items, like something related to Kerberos, causing some of the weird problems I'm seeing. – C Hamm Jan 28 '16 at 15:46
0

According to this article Client-Side DNS Prioritisation in Windows 10

Unfortunately, there is a catch; as of today, the implementation of the RFC in Windows 10, or at least the implementation of rule 9, is fundamentally broken. As indicated, rule 9 compares the client IP address with each address retrieved from a DNS query, identifying the value with the longest matching prefix. This comparison is based on the IPv6 translation of an IPv4 address (even if IPv6 is not enabled), but unfortunately, instead of basing the comparison on the length of an IPv6 address, the comparison incorrectly uses the length of an IPv4 address (i.e. only a part of the translated IPv6 address is actually used in the comparison). The product team at Microsoft have confirmed the bug, and have developed a private hotfix, which should be publicly available in the next few weeks.

Here is temporary solution (until reboot) from another article:

netsh int ipv6 set locality state=disabled

(C) DNS Subnet Prioritization does not work on Windows 10

-1

Try: NETSH winsock reset catalog and restart for sure

fox
  • 1
  • 2
    While this might work, it would be better if you could include an explanation of what this command is supposed to do and why the user should run it. – Jenny D Apr 06 '16 at 14:46
-2

OK, multi-homing has nothing to do with this, as I get the same issue with some of my Win 10 clients too. I've waded through and tried every solution I can find or think of - the DNS server is set up with no round-robin and DNS subnet prio on, I've tweaked the registry settings on both the servers and the hosts to push DNS subnet priority to on (set to the correct mask length, though we're on a /24 so it should default correctly anyway). It's actually pretty commonplace, as a quick google for the issue will show, and is not related to funky network setups - you can get the same thing on any network complex enough for subnet prioritization, but it's only some clients.

The issue is definitely, 100% a client-side problem. We only find the issue on Win 10 boxes (we had similar things with Win 8 at one stage, but that was cleared up with the reg hacks - Win 10 just laughs at them). The DNS queries are returning the correct answers (I've confirmed it with Wireshark too) and the clients will point to the correct address for 4-5 seconds after acquiring an IP, before changing to the incorrect subnet response.

Presently, I'm getting round it by hacking the hosts file on individual clients; after a reboot it'll always point correctly. I'm lucky enough to only have 250-ish machines, of which only around 15 are on Win 10 so far, of which only about half are affected, so it's presently manageable. But yeah, for larger deployments or mobile users it's not really a workable fix.

But yeah, just wanted to let you know that it's not just you, it's not due to your unusual network setup and it's not worth looking at the DNS for it, as DNS clearly isn't causing it.