Windows Server 2016 guests in WSFC cluster randomly quarantined due dropping heartbeat routes

Question

Having 2 guest Windows Server 2016 OSes hosted by Hyper-V Server 2016. The guest OS cluster is very unreliable and one of the nodes constantly gets quarantined (multiple times per day).

I also have Windows Server 2012R2 cluster. Those are hosted by same Hyper-V hosts and have no issues whatsoever. That means I have the same networking and hyper-v infrastructure in-between 2012R2 as well as 2016.

Further config for 2016 hosts:

In Network Connections, TCP/IPv6 is unchecked for all adapters. I'm aware that this actually doesn't disable IPv6 for Cluster as it uses hidden network adapter by NetFT and there it encapsulates IPv6 in IPv4 for heartbeats. I have the same configuration on good 2012R2 hosts.
Although 2012R2 cluster worked like I wanted without Witness, I initially configured 2016 the same. Trying to troubleshoot these issues, I added File Share Witness to 2016 cluster - no change.
Network Validation Report completes successfully

I know WHAT happens, but don't know WHY. The WHAT:

Cluster plays ping-pong with heartbeat UDP packets via multiple interfaces between both nodes on port 3343. Packets get sent approx. each second.
Suddenly 1 node stops playing ping-pong and doesn't respond. One node still tries to deliver heartbeat.
Well, I read cluster logfile to find out that the node removed routing info knowledge:

000026d0.000028b0::2019/06/20-10:58:06.832 ERR   [CHANNEL fe80::7902:e234:93bd:db76%6:~3343~]/recv: Failed to retrieve the results of overlapped I/O: 10060
000026d0.000028b0::2019/06/20-10:58:06.909 ERR   [NODE] Node 1: Connection to Node 2 is broken. Reason (10060)' because of 'channel to remote endpoint fe80::7902:e234:93bd:db76%6:~3343~ has failed with status 10060'
...
000026d0.000028b0::2019/06/20-10:58:06.909 WARN  [NODE] Node 1: Initiating reconnect with n2.
000026d0.000028b0::2019/06/20-10:58:06.909 INFO  [MQ-...SQL2] Pausing
000026d0.000028b0::2019/06/20-10:58:06.910 INFO  [Reconnector-...SQL2] Reconnector from epoch 1 to epoch 2 waited 00.000 so far.
000026d0.00000900::2019/06/20-10:58:08.910 INFO  [Reconnector-...SQL2] Reconnector from epoch 1 to epoch 2 waited 02.000 so far.
000026d0.00002210::2019/06/20-10:58:10.910 INFO  [Reconnector-...SQL2] Reconnector from epoch 1 to epoch 2 waited 04.000 so far.
000026d0.00002fc0::2019/06/20-10:58:12.910 INFO  [Reconnector-...SQL2] Reconnector from epoch 1 to epoch 2 waited 06.000 so far.
...
000026d0.00001c54::2019/06/20-10:59:06.911 INFO  [Reconnector-...SQL2] Reconnector from epoch 1 to epoch 2 waited 1:00.000 so far.
000026d0.00001c54::2019/06/20-10:59:06.911 WARN  [Reconnector-...SQL2] Timed out, issuing failure report.
...
000026d0.00001aa4::2019/06/20-10:59:06.939 INFO  [RouteDb] Cleaning all routes for route (virtual) local fe80::e087:77ce:57b4:e56c:~0~ to remote fe80::7902:e234:93bd:db76:~0~
000026d0.00001aa4::2019/06/20-10:59:06.939 INFO    <realLocal>10.250.2.10:~3343~</realLocal>
000026d0.00001aa4::2019/06/20-10:59:06.939 INFO    <realRemote>10.250.2.11:~3343~</realRemote>
000026d0.00001aa4::2019/06/20-10:59:06.939 INFO    <virtualLocal>fe80::e087:77ce:57b4:e56c:~0~</virtualLocal>
000026d0.00001aa4::2019/06/20-10:59:06.939 INFO    <virtualRemote>fe80::7902:e234:93bd:db76:~0~</virtualRemote>

Now the WHY part... Why does it do that? I don't know. Note that a minute earlyer it complains: Failed to retrieve the results of overlapped I/O. But I can still see UDP packets being sent/received

until the route was removed at 10:59:06 and only 1 node pings, but have no pongs. As seen in wireshark, there is no IP 10.0.0.19 and 10.250.2.10 in source column.

The route is re-added after some ~35 seconds, but that doesn't help - the node is already quarantined for 3 hours.

What am I missing here?

Are these two nodes running on the same hyper-v host? What does the networking look like on the host(s) and guests. Like how may NIC,s which subnets are each on? — Eric C. Singer, Jul 22 '19 at 22:56
@EricC.Singer, running on 2 different Hyper-V hosts. Total 2 Physical NIC on each HV Host being used. 3 Interfaces on each Guest where - 2 Interfaces share a subnet with other guest. - 1 interface different subnet (domain network). Don't think issue is here as Server 2012R2 has same configuration, same physical path and works without issues. — Janis Veinbergs, Jul 23 '19 at 11:01
Two more Q's. Are you using a dedicated vNIC for heart beats? if so, that's not needed and only adding complexity (at least for troubleshooting). Second, Q, have you ensured your drivers are completely up to date? I'm more used to ESXi, we could consider that our VMware tools. And have you checked the. — Eric C. Singer, Jul 23 '19 at 22:25
@EricC.Singer not dedicated for heartbeats. But they are dedicated for purpose -> Backup network, SQL Replication network. Yes, vmware has tools. But as of Windows Server 2016, Integration tools are now preinstalled and updated via windows updates. Guests are up-to-date. Hosts have newer updates since I last did my tests. Btw it may be related something to HV hosts. I had 1 cluster and 1 standalone host. Cluster was pure 2012R2, standalone W2016. Trying to debug, I'v updated 1 cluster host to W2016 which now hosts 2016 Guest. Should update whole cluster probably. — Janis Veinbergs, Jul 26 '19 at 06:40
probably a good start :) The only thing i would add, is to make sure you've disabled heart beats on the backup network. At least that is what i would do. — Eric C. Singer, Jul 26 '19 at 12:43
@EricC.Singer can confess that after Hyper-V Cluster update + Functional level update + VM Configuration version upgrade to 8.0 the issue still persists on guest hosts :/ Those VMs have been cloned before Failover cluster was created. — Janis Veinbergs, Aug 08 '19 at 10:25

score 3 · Accepted Answer · answered Jul 31 '20 at 09:02

I just had the same problem with a Windows Server 2019 Failover Cluster (for Hyper-V 2019). I usually also disable IPv6 on my Servers and that caused the problems. The cluster threw lots of Critical Errors and sometimes did a hard failover, even though a file share witness was also in place and working(?!).

Errors and Warnings I observed in the eventlog were:

FailoverClustering Event IDs:

1135 (Cluster node '....' was removed from the active failover cluster membership)
1146 (The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted)
1673 (Cluster node '....' has entered the isolated state.)
1681 (Virtual machines on node '....' have entered an unmonitored state.)

Service Control Manager Event IDs:

7024 (A quorum of cluster nodes was not present to form a cluster.)
7031 (The Cluster Service service terminated unexpectedly.)

FailoverClustering-Client

81 (Extended RPC error information)

Thanks to your research I got an important clue: The hidden adapter still uses IPv6. Since the article you had linked to said that it was not recommended or mainstream to disable IPv6 on the hidden adapter, but disabling it on all other adapters was supported and tested, i was wondering what stopped him from working.

Using the following Command I pulled the cluster logs (also thanks for the hint! I was not aware of this useful command):

# -Destination (Folder) in my case changed to be not on the "C:\" SATADOM (this thing is slow and has few write cycles)
# -TimeSpan (in minutes) limited to one of the Failovers because these logs get HUGE otherwise.
Get-ClusterLog -Destination "E:\" -TimeSpan 5

Unfortunately I had the same log entries you already have posted.

I re-enabled IPv6 on all adapters and reverted my tunnel related adapters/configs with:

Set-Net6to4Configuration -State Default
Set-NetTeredoConfiguration -Type Default
Set-NetIsatapConfiguration -State Default

That did not do the trick... Looking further I noticed that I also always disable "those unneeded" IPv6 related Firewall Rules... And that seemed to be the actually important Change! Those rules seem to affect the invisible adapter too.

The thing seems to be: IPv6 does not use ARP for finding the MAC addresses of its communication partners. It uses the Neighbor Discovery Protocol. And this protocol does not work, if you disable the associated Firewall Rules. While you can check the IPv4 ARP entries with:

arp -a

This won't show you the MAC addresses for IPv6 addresses. For those you can use:

netsh interface ipv6 show neighbors level=verbose

If you want, you can filter the output to your IPv6 adapter addresses like this:

netsh interface ipv6 show neighbors level=verbose | sls ".*fe80::1337:1337:1234:4321.*" -Context 4 |%{$_.Line,$_.Context.PostContext,""}

Doing that I found out, that those entries seem to be very short lived. The state of the entry for the Microsoft "Failover Cluster Virtual Adapter" link local address of the cluster partner was always toggling between "Reachable" and "Probe". I did not get the moment in which it was "Unreachable" though, but after re-enabling the IPv6 rules, the problem went away:

Get-NetFirewallRule -ID "CoreNet-ICMP6-*" | Enable-NetFirewallRule

Somehow this MAC address seems to be exchanged on another way between the cluster partners (probably because it is the "virtual Remote" address and not a real one?). So it keeps reappearing, leading to those wild Failover / Quarantine / Isolated states.

Probably disabling IPv6 on the invisible adapter would have helped too, but since this is not recommended, I now have decided to stop disabling IPv6 related things altogether. It's the future anyway :-)

Hope this helps another fellow IPv6-disabler!

I was on a long absence, now i'm back tackling this issue, forgetting that I had an answer to this issue, tried to recreate cluster, turning off adapters, upgrading to Win 2019 and nothing helped. Was glad to see this answer... tried it... and YES! Almost 24hours, NO BAD EVENTS from cluster! Otherwise it was matter of minutes/hour and the node got quarantined. I actually opened this thread to write a report of my vain efforts - and saw the solution :) I must have opened this in my absence thinking not relevant at that moment of life. Hope this was documented (ICMP traffic a MUST) — Janis Veinbergs, Aug 03 '21 at 06:01
Submitted feedback here: https://github.com/MicrosoftDocs/sql-docs/issues/6659 — Janis Veinbergs, Aug 03 '21 at 06:28
OMG this post was a lifesaver. So far seems to have done the trick. Same exact problem. I think it's probably one or a subset of the IPV6 firewall rules and FYI I DID NOT re-enable the ipv6 protocols on adapters, just enabled the firewall rules with the command provided. My hunch is it's one of the IPV6 multicast rules and the cluster service should probably define it's own for that and enable it by default as a part of the install. Would be nice if this showed up in the cluster validation wizard too which it doesn't. — Michael Adamission, Jun 01 '22 at 22:43

Windows Server 2016 guests in WSFC cluster randomly quarantined due dropping heartbeat routes

1 Answers1