I've got an on-premise Service Fabric consisting of 18 nodes (9 are seed nodes) - secured via gMSA windows security. Cluster code version 6.4.622.9590
Unfortunately I have to rebuild 6 of these nodes (3 Seed nodes). They all live in one data center (cluster spans 3 DCs). As such, I wish to remove these 6 nodes, rebuild them and then re-add them.
As per MSDOCs, adding/removing of nodes is performed via config upgrades. Note: I've already used this process recently to add 12 nodes so understand the concept of SF config upgrades well.
Unfortunately, I'm unable to do ANY config upgrades on this cluster until I remove the nodes - this is due to ValidationExceptions reported by the Start-ServiceFabricClusterConfigurationUpgrade
powershell command:
- If I don't add the 6 nodes to the "NodesToBeRemoved" section, I get validation error that not all removed nodes are in this field
- If I do add the 6 nodes, I get the following validation error:
Start-ServiceFabricClusterConfigurationUpgrade :
System.Runtime.InteropServices.COMException (-2147017627)
ValidationException: Model validation error. Removing a non-seed node and changing reliability level in the same
upgrade is not supported. Initiate an upgrade to remove node first and then change the reliability level.
At line:1 char:1
+ Start-ServiceFabricClusterConfigurationUpgrade -ClusterConfigPath "AL ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (Microsoft.Servi...usterConnection:ClusterConnection) [Start-ServiceFa
...gurationUpgrade], FabricException
+ FullyQualifiedErrorId : StartClusterConfigurationUpgradeErrorId,Microsoft.ServiceFabric.Powershell.StartClusterC
onfigurationUpgrade
So, we're stuck! I've also already removed node states, thus leaving all 6 nodes in the "Invalid State". The Get-ServiceFabricClusterConfiguration
does not return these 6 nodes, but they are still shown in SF Explorer and listed in the cluster manifest XML file.
As far as reliability level is concerned - I'm pretty sure one can no longer change this in SF; i.e. older versions of SF allowed you to configure bronze/silver/gold in config file, but in recent versions (+6.0??) - this is a calculated field and managed internally by SF. In any case - because the seed nodes will be decreased from 9 to 6, I suspect the internal calculated reliability level will drop (presumably from Gold to silver).
I've also come across a hack that someone has used to remove nodes in a cluster... but in my scenario, nodes are still listed in manifest file... Nonetheless, the words hack and production should never meet!
So, how do I get our production cluster out of this situation? Rebuilding the cluster is not an option (that's the whole reason for clusters...high availability!).