How to remove stale routes during Windows Cluster RESTART?

Question

BACKGROUND

I have a Windows Cluster (2016) with four nodes (3 NICs each). When I try to restart any of the cluster host server, the whole cluster going down and other nodes are randomly failing.

When I logged a case with Microsoft, they said it is because of the stale routes in NETFT table which is not cleared during the restart and gave me an workaround to restart all nodes to bring up the cluster.

I feel that's going to take long time before I restart my physical servers and bring UP my cluster. I’m having SLA which could breach.

Is there any helpful workaround?

MICROSOFT’s REPLY

From cluster.log, the issue looks related with the stale routes on NetFT.sys.

Log Analysis

(Below errors kept reporting on all 4 cluster nodes, taking one of those occurrences as an example:)

HOST1

2018/09/24-18:25:01.067 INFO  [FTI][Initiator] This node (1) is initiator
2018/09/24-18:25:01.067 WARN  [FTI][Initiator] `Ignoring duplicate connection: usable route already exists`
2018/09/24-18:25:01.067 INFO  [CHANNEL 192.1.0.172:~3343~] graceful close, status (of previous failure, may not indicate problem) (0)
2018/09/24-18:25:01.068 WARN  cxl::ConnectWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 192.1.0.172:~3343~ is closed'

HOST2

2018/09/24-18:25:01.095 INFO  [FTI][Initiator] This node (2) is initiator
2018/09/24-18:25:01.095 WARN  [FTI][Initiator] `Ignoring duplicate connection: usable route already exists`
2018/09/24-18:25:01.095 INFO  [CHANNEL 192.1.0.172:~3343~] graceful close, status (of previous failure, may not indicate problem) (0)
2018/09/24-18:25:01.096 WARN  cxl::ConnectWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 192.1.0.172:~3343~ is closed'

HOST3

2018/09/24-18:25:01.057 INFO  [FTI][Follower] This node (4) is not the initiator
2018/09/24-18:25:01.057 DBG   [FTI] Stream already exists to node 1: false
2018/09/24-18:25:01.057 DBG   [CHANNEL 192.1.0.170:~62824~] Close().
2018/09/24-18:25:01.057 INFO  [CHANNEL 192.1.0.170:~62824~] graceful close, status (of previous failure, may not indicate problem) (0)
2018/09/24-18:25:01.057 INFO  [CORE] Node 4: Clearing cookie [GUID]
2018/09/24-18:25:01.057 DBG   [CHANNEL 192.1.0.170:~62824~] Not closing handle because it is invalid.
2018/09/24-18:25:01.058 WARN  mscs::ListenerWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 192.1.0.170:~62824~ is closed'

HOST4

2018/09/24-18:25:01.087 INFO  [FTI][Initiator] This node (3) is initiator
2018/09/24-18:25:01.087 WARN  [FTI][Initiator] `Ignoring duplicate connection: usable route already exists`
2018/09/24-18:25:01.087 INFO  [CHANNEL 192.1.0.172:~3343~] graceful close, status (of previous failure, may not indicate problem) (0)
2018/09/24-18:25:01.088 WARN  cxl::ConnectWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 192.1.0.172:~3343~ is closed'

Those stale routes are the culprit for the nodes to join the cluster and that’s why the node was not able to join back to the cluster.

For NetFT, as the cluster network, any unexpected removed from membership, the NetFT route table is not getting cleared. The connection remained.

When the initiator node tried to create new connection, as the routing table still got the old one, the nodes finally failed to join back to the cluster. The NETFT is a kernel level driver and that’s why we need to reboot the nodes to refresh the NETFT table.

Action Plan

Please try to reboot all cluster nodes at the same time to remove the stale routes.

score 0 · Answer 1 · answered Oct 15 '18 at 17:06

I just experienced this over the weekend on a two node SQL AlwaysOn Cluster. I had to reboot the primary node to get it back. This happened after some networking changes on the network along with Windows Update patching on the same day.

I ran pssdiag to dump the cluster log and saw the exact same entries. Ran it again after the reboot and they were gone.