Does anyone have advice on how to reimage Linux nodes in my Azure Batch Pool account without resizing to 0 then back to N, or deleting the pool and creating it again?
Or is that the recommended best practice
More details:
I'm having issues with re-imaging an azure node. When I update the docker image, and redeploy using ARM template, the node is not pulling the latest docker image. I think this may be because the image name is the same (I always want the latest image)
I've tried using:
Reset-AzureBatchComputeNode but that gives me the following error on fiddler "Operation reimage can be invoked only on pools created with cloudServiceConfiguration". I can't use cloud service configuration because the machine needs to be a Linux machine.
Restart-AzureBatchComputeNode, but that only restarts the node rather than re-imaging it
I might just have to nuke the nodes (resize to 0, then spin up as many as I need again), or simply delete the pool and then set it up again. But these seem like "nuclear" options and the batch service would be down until the nodes have been spun up again.
arm template I use to deploy/update the batch pool
{
"name": "[concat(parameters('batchAccountName'), '/<pool-name>')]",
"type": "Microsoft.Batch/batchAccounts/pools",
"apiVersion": "2018-12-01",
"properties": {
"vmSize": "[parameters('vmSize')]",
"deploymentConfiguration": {
"virtualMachineConfiguration": {
"nodeAgentSkuId": "batch.node.ubuntu 16.04",
"imageReference": {
"publisher": "microsoft-azure-batch",
"offer": "ubuntu-server-container",
"sku": "16-04-lts",
"version": "latest"
},
"containerConfiguration": {
"type": "DockerCompatible",
"containerImageNames": [
"[concat(parameters('containerRegistryServer'), '/<container-name>')]"
],
"containerRegistries": [
<credentials>
]
}
}
},
"scaleSettings": {
"fixedScale": {
"targetDedicatedNodes": "[parameters('targetDedicatedNodes')]"
}
}
},
"dependsOn": [
"[resourceId('Microsoft.Batch/batchAccounts', parameters('batchAccountName'))]"
]
},
--
UPDATE:
Thanks @fpark, with your advice I came up with the following powershell script in case anyone else
Write-Output "Building docker image"
$imageHashBeforeBuild = docker images $DockerImageName --format "{{.ID}}" --no-trunc
docker build -t $DockerImageName $pathToEnergyModel
if (!$?) {
throw "Docker image $DockerImageName failed to build"
}
$imageHashAfterBuild = docker images $DockerImageName --format "{{.ID}}" --no-trunc
...
$batchContext = Get-AzureRmBatchAccount -Name $batchAccountName
...
# The nodes should only be reimaged if the model has an update and this is NOT a new deployment
$ShouldReimageNodes = $IsUpdate -and $imageHashBeforeBuild -and ($imageHashBeforeBuild -ne $imageHashAfterBuild)
# The batchAccountDeployment step will create/update batch accounts/pools,
# However, the deployment does not update the VM image to the latest present in the docker container registry
# This is likely due to the ARM template having the same settings, so it doesn't know to try pull the image down again
# As a work around:
# 1) Grab all current nodes
# 2) For each node:
# a) Bring it down (this has a side effect of reducing TargetDedicatedComputeNodes by 1)
# b) Resize the number of TargetDedicatedComputeNodes to correct value (i.e. spin up a node to replace the one downed in 2a)
# When the VM's come back up, they indeed pull the latest docker image
if ($ShouldReimageNodes) {
# Wait for nodes to stabilize
Write-Host "Difference in docker images detected. Restarting each node one at a time to ensure latest docker image is being used."
while ((Get-AzureBatchPool -BatchContext $batchContext -Id $PoolName).AllocationState -ne "Steady") {
Write-Host "Waiting for nodes in $PoolName to stabilize. Checking status again in $SleepTime seconds."
Start-Sleep -Seconds $SleepTime
}
$nodes = Get-AzureBatchComputeNode -PoolId $PoolName -BatchContext $batchContext
$currentNodeCount = $nodes.Length
foreach ($node in $nodes) {
$nodeId = $node.Id
Write-Host "Removing node $nodeId"
Remove-AzureBatchComputeNode -ComputeNode $node -BatchContext $batchContext -Force
while ((Get-AzureBatchPool -BatchContext $batchContext -Id $PoolName).AllocationState -ne "Steady") {
Write-Host "Waiting for nodes in $PoolName to stabilize. Checking status again in $SleepTime seconds."
Start-Sleep -Seconds $SleepTime
}
Write-Host "Resizing back to $currentNodeCount"
Start-AzureBatchPoolResize -Id $PoolName -BatchContext $batchContext -TargetDedicatedComputeNodes $currentNodeCount
while ((Get-AzureBatchPool -BatchContext $batchContext -Id $PoolName).AllocationState -ne "Steady") {
Write-Host "Waiting for nodes in $PoolName to stabilize. Checking status again in $SleepTime seconds."
Start-Sleep -Seconds $SleepTime
}
}
}