0

We recently needed to add the Microsoft.Powershell.DSC extension to our VMSS that contain our service fabric cluster. We redeployed the cluster using our ARM template, with the addition of the new extension for DSC. During the deployment we observed that as many as 4 out of 5 scale set instances were in the restarting stage at a given time. The services in our cluster were also unresponsive during that time. The outage was only a few minutes long, but this seems like something that should not happen.

  • Reliability Level: Silver
  • Durability Level: Bronze
  • For clarification and others to learn from this, did this mean that state was lost or did you only experience an outage? – Poul K. Sørensen Aug 07 '17 at 08:33
  • We experienced an outage. We have both stateless and stateful apps. I was testing the stateless app while the update was applying. I don't believe any state was lost during the upgrade. – Jeff Bailey Aug 07 '17 at 12:22

2 Answers2

0

I suggest reading this article. Its a MS employee blog. I'll copy out the relevant part:

If you don’t mind all your VMs being rebooted at the same time, you can set upgradePolicy to “Automatic”. Otherwise set it to “Manual” and take care of applying changes to the scale set model to individual VMs yourself. It is fairly easy to script rolling out the update to VMs while maintaining application uptime. See https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-upgrade-scale-set for more details.

If your scale set is in a Service Fabric cluster, certain updates like changing OS version are blocked (currently – that will change in future), and it is recommended that upgradePolicy be set to “Automatic”, as Service Fabric takes care of safely applying model changes (like updated extension settings) while maintaining availability.

4c74356b41
  • 69,186
  • 6
  • 100
  • 141
  • This is a Service Fabric cluster VMSS I have and it is letting most of the VMs reboot at the same time. Likely due to the durability level of bronze as @LoekD suggested? – Jeff Bailey Aug 03 '17 at 20:44
0

This is caused by the selected durability level 'bronze'.

The durability tier is used to indicate to the system the privileges that your VMs have with the underlying Azure infrastructure. In the primary node type, this privilege allows Service Fabric to pause any VM level infrastructure request (such as a VM reboot, VM reimage, or VM migration) that impact the quorum requirements for the system services and your stateful services. In the non-primary node types, this privilege allows Service Fabric to pause any VM level infrastructure requests like VM reboot, VM reimage, VM migration etc., that impact the quorum requirements for your stateful services running in it.

Bronze - No privileges. This is the default and is recommended if you are only > running stateless workloads in your cluster.

Community
  • 1
  • 1
LoekD
  • 11,402
  • 17
  • 27
  • I suspect you are right here, however the service I was testing was a stateless service. Do you happen to know why Bronze would be OK for stateless services? If the VMSS reboots all of the machines wouldn't the stateless services become unresponsive? – Jeff Bailey Aug 03 '17 at 20:46
  • Yes, stateless services can keep on running even if there's one node running. (e.g. secondary node) However, the system services cannot. – LoekD Aug 04 '17 at 08:39