Service Fabric - Cannot do Config upgrade to add or remove nodes

Question

I've got an on-premise Service Fabric consisting of 18 nodes (9 are seed nodes) - secured via gMSA windows security. Cluster code version 6.4.622.9590

Unfortunately I have to rebuild 6 of these nodes (3 Seed nodes). They all live in one data center (cluster spans 3 DCs). As such, I wish to remove these 6 nodes, rebuild them and then re-add them.

As per MSDOCs, adding/removing of nodes is performed via config upgrades. Note: I've already used this process recently to add 12 nodes so understand the concept of SF config upgrades well.

Unfortunately, I'm unable to do ANY config upgrades on this cluster until I remove the nodes - this is due to ValidationExceptions reported by the Start-ServiceFabricClusterConfigurationUpgradepowershell command:

If I don't add the 6 nodes to the "NodesToBeRemoved" section, I get validation error that not all removed nodes are in this field
If I do add the 6 nodes, I get the following validation error:

Start-ServiceFabricClusterConfigurationUpgrade : 
System.Runtime.InteropServices.COMException (-2147017627)
ValidationException: Model validation error. Removing a non-seed node and changing reliability level in the same
upgrade is not supported. Initiate an upgrade to remove node first and then change the reliability level.
At line:1 char:1
+ Start-ServiceFabricClusterConfigurationUpgrade -ClusterConfigPath "AL ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (Microsoft.Servi...usterConnection:ClusterConnection) [Start-ServiceFa
   ...gurationUpgrade], FabricException
    + FullyQualifiedErrorId : StartClusterConfigurationUpgradeErrorId,Microsoft.ServiceFabric.Powershell.StartClusterC
   onfigurationUpgrade

So, we're stuck! I've also already removed node states, thus leaving all 6 nodes in the "Invalid State". The Get-ServiceFabricClusterConfiguration does not return these 6 nodes, but they are still shown in SF Explorer and listed in the cluster manifest XML file.

As far as reliability level is concerned - I'm pretty sure one can no longer change this in SF; i.e. older versions of SF allowed you to configure bronze/silver/gold in config file, but in recent versions (+6.0??) - this is a calculated field and managed internally by SF. In any case - because the seed nodes will be decreased from 9 to 6, I suspect the internal calculated reliability level will drop (presumably from Gold to silver).

I've also come across a hack that someone has used to remove nodes in a cluster... but in my scenario, nodes are still listed in manifest file... Nonetheless, the words hack and production should never meet!

So, how do I get our production cluster out of this situation? Rebuilding the cluster is not an option (that's the whole reason for clusters...high availability!).

score 2 · Answer 1 · answered Apr 04 '19 at 04:31

I discovered that the above errors are primarily a symptom of lack of clearly documented procedures as well as bad/misleading error messages when doing service fabric configuration upgrades.

I performed quite a bit of my own testing to make sure I can confidently add/remove several nodes from a cluster. I also removed enough nodes to drop the Seed nodes from 9 to 6.

So, to resolve the above issue, here's what I had to do to remove nodes:

Use the SF explorer to remove node state - this changed node state from Error to Invalid

Get latest json config via Get-ServiceFabricClusterConfiguration

Remove the node from Nodes section

Completely remove the NodesToBeRemoved json section (i.e. you'll get the inconsistent error if you have an empty list of nodes to be removed - so just remove the containing json block

Do a config update

Note: Initially I tried just doing 2-5 above - but it didn't work and the node remained in error state.

That said, from my experience, please also note the following when removing nodes (this info is not clear in MSDOC:

You can remove multiple Seed nodes at once (I wanted to do this to try and replicate above scenario)

You can add multiple nodes at once too - just be aware you may not see any activity/indication via SF config upgrade status tooling that anything is happening... be prepared to wait at least +15 minutes (depends on how many nodes you're adding...afterall, SF is copying installation files to the nodes)

Sometimes, when removing one or more nodes, the node won't be successfully removed - but left in an Error status. If this is the case, use the SF Explorer (or powershell) to remove node state. Status will change to Invalid. At this point, do another config upgrade ensuring that:

The removed node(s) are not in Nodes section

The removed node(s) are not in the NodesToBeRemoved list

As per above, if the value of NodesToBeRemoved is (or should be) empty, remove this whole JSON block otherwise you'll get a misleading/vague warning about NodesToBeRemoved parameter contains inconsistent information.

The latter part really is the confusing part that tripped me up last time. The thing to also remember is that, once you successfully remove nodes, the Get-ServiceFabricClusterConfiguration will STILL return the removed nodes in the NodesToBeRemoved parameter. This will likely confuse/trip you up with any subsequent attempts to do a config upgrade. As such, I recommend you do another final config upgrade with this section completely removed.

As a final note: If you re-add a node that has previously been removed, it may come back in a Deactivated status. Simply activate this node and all should be fine.

Service Fabric - Cannot do Config upgrade to add or remove nodes

1 Answers1