2

We have been using low priority nodes in my company for a really long time. Every now and then we get preempted nodes but our tasks end up running eventually.

We have 1 dedicated node, and we scale up to 20 low priority ones. For the last 3 days no low priority node has been created when scaling. Scaling shows it's trying to set up more nodes but it just stays like that.

Is there any solution for this? Are low priority nodes broken at the moment?

2 Answers2

1

Use low-priority VMs with Batch

The tradeoff for using low-priority VMs is that those VMs may not be available to be allocated or may be preempted at any time, depending on available capacity.

Expecting a low priority instance to be available at all is unreasonable. Excess capacity may simply not be there indefinitely.

First check the status dashboard and/or social media. I don't see reported problems with Batch.

Shop different instance sizes in different regions. Sometimes the lack of surplus for your favorite size is local.

Add full price instances for the work that must get done.

John Mahowald
  • 32,050
  • 2
  • 19
  • 34
  • Low priority nodes tend to be available at some point during the day. They definitely are never unavailable for 3 days in a row. We've been using them for 2 years now. The problem was that some VM objects/resources were deleted, this broke low priority nodes but not dedicated ones. – Renato Fontes May 13 '19 at 17:12
0

The problem was quite complicated, and probably a bug in Azure Batch.

Some VM objects had been removed from the resource group by someone in the team. This caused Low Priority VMs to be unable to start, the weird part is that Dedicated VMs actually started correctly.

I solved it by using the VM image to create a new VM, then a new IMAGE, and recreated the Azure Batch pool using this new image. The important part was not deleting the VM objects.

  • One more thing so you know if you are trying to figure something out in Batch: Batch is buggy. Specially when using linux VMs. Ubuntu 1804 was completely broken for batch at least a few months ago(I'm using 1604). – Renato Fontes May 13 '19 at 17:14