Long Running Tasks in Service Fabric and Scaling Cluster In

Question

We are using Azure Service Fabric (Stateless Service) which gets messages from the Azure Service Bus Message Queue and processes them. The tasks generally take between 5 mins and 5 hours.

When its busy we want to scale out servers, and when it gets quiet we want to scale back in again.

How do we scale in without interrupting long running tasks? Is there a way we can tell Service Fabric which server is free to scale in?

As far as I know, SF doesn't scale automatically and you have to update your cluster to scale out by adding more VMs. — Sean Feldman, Aug 08 '17 at 22:17
That's what I meant. I'm just not sure there's something that would add/remove VMs to/from VMSS based on a status of a service. Saying that, if you have a stateless service OOTB. — Sean Feldman, Aug 08 '17 at 22:45
Ah right :) What I was hoping is that there is some way you could select which node to shutdown. So right now VMSS would just scale down one node even if work is still happening on it? (Which is what I thought/was worried about) — tank104, Aug 08 '17 at 23:40
One thing I thought of - is it possible to scale down the node which has the lowest CPU usage? — tank104, Aug 09 '17 at 21:59
I see two options if you want to scale (automatically): using Azure Function or Azure Batch. Azure function does not allow long running tasks but you can try splitting up your code to have smaller tasks ? Otherwise Azure Batch is desing to perform this kind of long running task and can scale automatically — Thomas, Aug 10 '17 at 03:50

Kiryl · Accepted Answer · 2017-08-14T07:32:24.577

1

Azure Monitor Custom Metric
- Integrate your SF service with EventFlow. For instance, make it sending logs into Application Insights
- While your task is being processed, send some logs in that will indicate that it's in progress
- Configure custom metric in Azure Monitor to scale in only in case on absence of the logs indicating that machine has in-progress tasks

The trade-off here is to wait for all the events finished until the scale-in could happen.

There is a good article that explains how to Scale a Service Fabric cluster programmatically
Here is another approach which requires a bit of coding - Automate manual scaling
- Develop another service either as part of SF application or as VM extension. The point here is to make the service running on all the nodes in a cluster and track the status of tasks execution.
- There are well-defined steps how one could manually exclude SF node from the cluster -
- Run Disable-ServiceFabricNode with intent ‘RemoveNode’ to disable the node you’re going to remove (the highest instance in that node type).
- Run Get-ServiceFabricNode to make sure that the node has indeed transitioned to disabled. If not, wait until the node is disabled. You cannot hurry this step.
- Follow the sample/instructions in the quick start template gallery to change the number of VMs by one in that Nodetype. The instance removed is the highest VM instance.
- And so forth... Find more info here Scale a Service Fabric cluster in or out using auto-scale rules. The takeaway here is that these steps could be automated.

Implement scaling logic in a new service to monitor which nodes are finished with their tasks and stay idle to scale them in using instructions described in previous steps.

Hopefully it makes sense.

Thanks a lot to @tank104 for the help on elaborating my answer!

edited Aug 14 '17 at 07:32

answered Aug 09 '17 at 10:33

Kiryl

1,416
9
21

That does - but it would mean I could only scale down when all events had finished - as I wouldn't be able to know which node is sending or not sending events to scale down? – tank104 Aug 09 '17 at 21:59
Truly should be a function of SF itself. Just like Cloud Services could scales out/in. – Sean Feldman Aug 09 '17 at 22:03
@tank104 - yeah, you won't get your node shut down while precious task is running, but you will have to wait to all events finished. I've gotten another idea, I'll update the answer soon. – Kiryl Aug 10 '17 at 08:10
Thanks - I am going to try your suggestion now and will feedback – tank104 Aug 10 '17 at 23:03
@tank104 Looking forward to your reply – Kiryl Aug 11 '17 at 06:31
OK lots of playing around, and not too much code to show for it, but right now where I am is that that each node (Stateless Service) checks where is primary seed, and if its not and it isn't processing anything that it calls: `using (var client = new FabricClient()) { await client.ClusterManager.DeactivateNodeAsync(nodeName, NodeDeactivationIntent.RemoveNode); }` This doesn't seem to do much though. – tank104 Aug 11 '17 at 06:59
NExt I am trying this `var scaleSet = AzureClient.VirtualMachineScaleSets.GetById(vmssResourceID); long scaleSetCapcity = scaleSet.Capacity; var virtualMachine = scaleSet.VirtualMachines.List().FirstOrDefault(v => v.Name == "XXXX"); virtualMachine.Delete();` – tank104 Aug 11 '17 at 07:03
@tank104 "When you scale a cluster down you will see the removed node/VM instance displayed in an unhealthy state unless you call Remove-ServiceFabricNodeState cmd with the appropriate node name." Try to do this just with Powershell at first, to make that it works - and then shift it in code. – Kiryl Aug 11 '17 at 07:30
It seems to be ok - perhaps because I have silver Durability/reliability so it does that? Should I call RemoveNodeStateAsync before or after I call DeactiveNodeAsync? – tank104 Aug 11 '17 at 09:22
Now that you mention it - I did see the error state briefly - so perhaps that is why.Do you know which order I should call it in? Ahhh actually I see the call now here https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-programmatic-scaling so it should be at the end? I didn't notice that last section. – tank104 Aug 11 '17 at 09:25
@tank104 That's my understanding - it needs to be the last command. – Kiryl Aug 11 '17 at 12:18
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/151726/discussion-between-kiryl-z-and-tank104). – Kiryl Aug 11 '17 at 13:09
I am going to mark this as answered - see my comments for extra information on changes I did. Thanks Kiryl – tank104 Aug 14 '17 at 06:11

Long Running Tasks in Service Fabric and Scaling Cluster In

1 Answers1