3

Does anyone have advice on how to reimage Linux nodes in my Azure Batch Pool account without resizing to 0 then back to N, or deleting the pool and creating it again?

Or is that the recommended best practice

More details:

I'm having issues with re-imaging an azure node. When I update the docker image, and redeploy using ARM template, the node is not pulling the latest docker image. I think this may be because the image name is the same (I always want the latest image)

I've tried using:

Reset-AzureBatchComputeNode but that gives me the following error on fiddler "Operation reimage can be invoked only on pools created with cloudServiceConfiguration". I can't use cloud service configuration because the machine needs to be a Linux machine.

Restart-AzureBatchComputeNode, but that only restarts the node rather than re-imaging it

I might just have to nuke the nodes (resize to 0, then spin up as many as I need again), or simply delete the pool and then set it up again. But these seem like "nuclear" options and the batch service would be down until the nodes have been spun up again.

arm template I use to deploy/update the batch pool

{
      "name": "[concat(parameters('batchAccountName'), '/<pool-name>')]",
      "type": "Microsoft.Batch/batchAccounts/pools",
      "apiVersion": "2018-12-01",
      "properties": {
        "vmSize": "[parameters('vmSize')]",
        "deploymentConfiguration": {
          "virtualMachineConfiguration": {
            "nodeAgentSkuId": "batch.node.ubuntu 16.04",
            "imageReference": {
              "publisher": "microsoft-azure-batch",
              "offer": "ubuntu-server-container",
              "sku": "16-04-lts",
              "version": "latest"
            },
            "containerConfiguration": {
              "type": "DockerCompatible",
              "containerImageNames": [
                "[concat(parameters('containerRegistryServer'), '/<container-name>')]"
              ],
              "containerRegistries": [
                <credentials>
              ]
            }
          }
        },
        "scaleSettings": {
          "fixedScale": {
            "targetDedicatedNodes": "[parameters('targetDedicatedNodes')]"
          }
        }
      },
      "dependsOn": [
        "[resourceId('Microsoft.Batch/batchAccounts', parameters('batchAccountName'))]"
      ]
    },

--

UPDATE:

Thanks @fpark, with your advice I came up with the following powershell script in case anyone else

Write-Output "Building docker image"
$imageHashBeforeBuild = docker images $DockerImageName --format "{{.ID}}" --no-trunc
docker build -t $DockerImageName $pathToEnergyModel
if (!$?) {
    throw "Docker image $DockerImageName failed to build"
}
$imageHashAfterBuild = docker images $DockerImageName --format "{{.ID}}" --no-trunc

...

$batchContext = Get-AzureRmBatchAccount -Name $batchAccountName

...

# The nodes should only be reimaged if the model has an update and this is NOT a new deployment
$ShouldReimageNodes = $IsUpdate -and $imageHashBeforeBuild -and ($imageHashBeforeBuild -ne $imageHashAfterBuild)
# The batchAccountDeployment step will create/update batch accounts/pools, 
# However, the deployment does not update the VM image to the latest present in the docker container registry
# This is likely due to the ARM template having the same settings, so it doesn't know to try pull the image down again
# As a work around:
#   1) Grab all current nodes
#   2) For each node:
#       a) Bring it down (this has a side effect of reducing TargetDedicatedComputeNodes by 1)
#       b) Resize the number of TargetDedicatedComputeNodes to correct value (i.e. spin up a node to replace the one downed in 2a)
# When the VM's come back up, they indeed pull the latest docker image
if ($ShouldReimageNodes) {
    # Wait for nodes to stabilize
    Write-Host "Difference in docker images detected. Restarting each node one at a time to ensure latest docker image is being used."
    while ((Get-AzureBatchPool -BatchContext $batchContext -Id $PoolName).AllocationState -ne "Steady") {
        Write-Host "Waiting for nodes in $PoolName to stabilize. Checking status again in $SleepTime seconds."
        Start-Sleep -Seconds $SleepTime
    }
    $nodes = Get-AzureBatchComputeNode -PoolId $PoolName -BatchContext $batchContext
    $currentNodeCount = $nodes.Length
    foreach ($node in $nodes) {
        $nodeId = $node.Id
        Write-Host "Removing node $nodeId"
        Remove-AzureBatchComputeNode -ComputeNode $node -BatchContext $batchContext -Force
        while ((Get-AzureBatchPool -BatchContext $batchContext -Id $PoolName).AllocationState -ne "Steady") {
            Write-Host "Waiting for nodes in $PoolName to stabilize. Checking status again in $SleepTime seconds."
            Start-Sleep -Seconds $SleepTime
        }
        Write-Host "Resizing back to $currentNodeCount"
        Start-AzureBatchPoolResize -Id $PoolName -BatchContext $batchContext -TargetDedicatedComputeNodes $currentNodeCount
        while ((Get-AzureBatchPool -BatchContext $batchContext -Id $PoolName).AllocationState -ne "Steady") {
            Write-Host "Waiting for nodes in $PoolName to stabilize. Checking status again in $SleepTime seconds."
            Start-Sleep -Seconds $SleepTime
        }
    }
}
Pranshu
  • 33
  • 3
  • Note that `Remove-AzureBatchComputeNode` can take a list of nodes. It's best to batch your removals together and resize once. You'll reach your target state much faster that way. Additionally, you should consider using a start task to always issue a docker pull against your image. That way you can just reboot your nodes instead of this method. There are also other alternatives like running a multi-instance no-op task with job prep issuing docker pull. – fpark Mar 29 '19 at 13:58
  • Thanks @fpark! I like your suggestion regarding a start task. I'll look into it next and update this if that path ends up working for me. However - I didn't want to resize to 0 and back to N because that would mean our job pool would be unavailable for a portion of time. I would prefer that the nodes are available (even if they are out of date) so that jobs can run while the new model is being applied to the nodes – Pranshu Mar 29 '19 at 18:02

1 Answers1

2

Currently, re-imaging operations on Virtual Machine Configuration based pools is not supported. Please see this uservoice idea.

You can emulate reimaging a set of nodes by invoking the Remove-AzureBatchComputeNode cmdlets, then resizing back to your desired size.

fpark
  • 2,304
  • 2
  • 14
  • 21