BACKGROUND
I have a Windows Cluster (2016) with four nodes (3 NICs each). When I try to restart any of the cluster host server, the whole cluster going down and other nodes are randomly failing.
When I logged a case with Microsoft, they said it is because of the stale routes in NETFT table which is not cleared during the restart and gave me an workaround to restart all nodes to bring up the cluster.
I feel that's going to take long time before I restart my physical servers and bring UP my cluster. I’m having SLA which could breach.
Is there any helpful workaround?
MICROSOFT’s REPLY
From cluster.log
, the issue looks related with the stale routes on NetFT.sys
.
Log Analysis
(Below errors kept reporting on all 4 cluster nodes, taking one of those occurrences as an example:)
HOST1
2018/09/24-18:25:01.067 INFO [FTI][Initiator] This node (1) is initiator
2018/09/24-18:25:01.067 WARN [FTI][Initiator] `Ignoring duplicate connection: usable route already exists`
2018/09/24-18:25:01.067 INFO [CHANNEL 192.1.0.172:~3343~] graceful close, status (of previous failure, may not indicate problem) (0)
2018/09/24-18:25:01.068 WARN cxl::ConnectWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 192.1.0.172:~3343~ is closed'
HOST2
2018/09/24-18:25:01.095 INFO [FTI][Initiator] This node (2) is initiator
2018/09/24-18:25:01.095 WARN [FTI][Initiator] `Ignoring duplicate connection: usable route already exists`
2018/09/24-18:25:01.095 INFO [CHANNEL 192.1.0.172:~3343~] graceful close, status (of previous failure, may not indicate problem) (0)
2018/09/24-18:25:01.096 WARN cxl::ConnectWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 192.1.0.172:~3343~ is closed'
HOST3
2018/09/24-18:25:01.057 INFO [FTI][Follower] This node (4) is not the initiator
2018/09/24-18:25:01.057 DBG [FTI] Stream already exists to node 1: false
2018/09/24-18:25:01.057 DBG [CHANNEL 192.1.0.170:~62824~] Close().
2018/09/24-18:25:01.057 INFO [CHANNEL 192.1.0.170:~62824~] graceful close, status (of previous failure, may not indicate problem) (0)
2018/09/24-18:25:01.057 INFO [CORE] Node 4: Clearing cookie [GUID]
2018/09/24-18:25:01.057 DBG [CHANNEL 192.1.0.170:~62824~] Not closing handle because it is invalid.
2018/09/24-18:25:01.058 WARN mscs::ListenerWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 192.1.0.170:~62824~ is closed'
HOST4
2018/09/24-18:25:01.087 INFO [FTI][Initiator] This node (3) is initiator
2018/09/24-18:25:01.087 WARN [FTI][Initiator] `Ignoring duplicate connection: usable route already exists`
2018/09/24-18:25:01.087 INFO [CHANNEL 192.1.0.172:~3343~] graceful close, status (of previous failure, may not indicate problem) (0)
2018/09/24-18:25:01.088 WARN cxl::ConnectWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 192.1.0.172:~3343~ is closed'
Those stale routes are the culprit for the nodes to join the cluster and that’s why the node was not able to join back to the cluster.
For NetFT, as the cluster network, any unexpected removed from membership, the NetFT route table is not getting cleared. The connection remained.
When the initiator node tried to create new connection, as the routing table still got the old one, the nodes finally failed to join back to the cluster. The NETFT is a kernel level driver and that’s why we need to reboot the nodes to refresh the NETFT table.
Action Plan
Please try to reboot all cluster nodes at the same time to remove the stale routes.