Delayed sequential restart of Compute Engine VMs in Managed Instance Groups

Question

I have a Managed Instance Group of Google Compute Engine VMs (based on a template with container deployment on Container-Optimized OS). The MIG is regional (multi-zoned).

I can release an updated container image (docker run, docker tag, docker push), and then I'd like to restart all VMs in the MIG one by one, so that they can have the updated container (not sure if there's a simpler/better alternative to refresh the VMs attached container). But I also want to introduce a slight delay (say 60 seconds) between each VM's restart event, so that only one or two VMs are unavailable during their restart.

What are some ways to do this programmatically (either via gcloud CLI or their API)?

I tried a rolling restart of the MIG, with maximum unavailable and minimum wait time flags set:

gcloud beta compute instance-groups managed rolling-action restart MIG_NAME \
    --project="..." --region="..." \
    --max-unavailable=1 --min-ready=60

... but it returns an error:

ERROR: (gcloud.beta.compute.instance-groups.managed.rolling-action.restart) Could not fetch resource:
 - Invalid value for field 'resource.updatePolicy.maxUnavailable.fixed': '1'. Fixed updatePolicy.maxUnavailable for regional managed instance group has to be either 0 or at least equal to the number of zones.

Is there a way to perform one-by-one instance restarts with a slight delay in between each action?

Unfortunately, this feature is not implemented yet for regional deployments. It works correctly for zonal ones. — Grzenio, Jan 13 '23 at 07:17
Thanks @Grzenio, do you think using `gcloud beta compute instances update-container` iteratively for each instance, with a slight delay (e.g. sleep()) in between each call will be a good workaround? — Nick, Jan 13 '23 at 07:22
Frankly, I am not able to figure out what `gcloud compute instances update-container` actually does, but let me suggest a semi-manual solution using the MIG api. — Grzenio, Jan 13 '23 at 09:35
No worries, btw here's the doc on `update-container` command: https://cloud.google.com/compute/docs/containers/deploying-containers#updating_a_container_on_a_vm_instance — Nick, Jan 14 '23 at 05:26

score 1 · Accepted Answer · answered Jan 13 '23 at 10:08

1

Unfortunately the MIGs don't handle this use-case for regional deployments as at Jan 2023. You can, however, orchestrate the rolling update yourself along (sudo code):

for (INSTANCE in instances)
  // Force restart the instance
  gcloud compute instance-groups managed update-instances MIG_NAME \
      --project="..." --region="..." \
      --instances=INSTANCE --minimal-action=RESTART \
      --most-disruptive-allowed-action=RESTART

  WAIT

  if (container on INSTANCE not working correctly)
      // Break and alert the operator

answered Jan 13 '23 at 10:08

Grzenio

35,875
47
158
240

Thanks. How do I get the list of instances across multiple MIGs? Also, how to dynamically set the `region` in this case? – Nick Jan 14 '23 at 03:38
Strangely, the above `update-instances` command with `restart` actions still replaces the VMs and allots a new IPs, instead of just restarting the VMs to keep the same IPs. – Nick Jan 14 '23 at 04:43
In GCE "restart" actually means deleting the VM, creating a new one and reprogramming of the networking. Having said that, I would expect the VMs to keep the same IP in the process. Would you be able to ask another question specifically about that (for clarity) and add more details, like the full configuration of the IGM, Instance before and after the restart, etc.? – Grzenio Jan 15 '23 at 10:44
From what I've understood reading the docs, "replace" action deletes the VM and creates a new one, "restart" action should just restart/reboot the machine without changing the machine name or IP, as long as the replacement method for it is "replace" instead of "substitute" (as per https://cloud.google.com/compute/docs/instance-groups/rolling-out-updates-to-managed-instance-groups#replacement_method). – Nick Jan 17 '23 at 10:43
Yeah, it is misleading. Either way, you are correct that the IP should be preserved. My suspicion is that the instance got auto-repaired for some reason. – Grzenio Jan 18 '23 at 07:54

score 1 · Answer 2 · edited Mar 10 '23 at 05:47

Trying looking into opportunistic updates instead of rolling updates. We have a similar scenario. Rolling updates for MIG, particularly a stateful one won't work as it will bring down at least a minimum number (ideally the number of zones that you have in your MIG) With opportunistic updates, you can try to achieve what you are looking for. Currently we implement it the following way:

Set the instance template of the MIG to the new instance template created from new image

gcloud compute instance-groups managed set-instance-template ${instanceName} template=${instanceName}-${tag}

Run a for loop and update each VM with new template. Google provides a command which will pause the execution of the script till the MIG is stable, this ensures that you are not applying updates to another VM until your current instance is stable.

for (( i = 1; i <= $number_of_nodes; i++ ))
    do
        echo "Trying to update Kafka Node${i} with new instance template ${instanceName}-${tag}"
        (set -x
            gcloud compute instance-groups managed update-instances ${instanceName}-group \
           --instances=${instanceName}-kafka-node${i} \
        )
        echo "Checking for MIG stabiltiy"
        (set -x
            gcloud compute instance-groups managed wait-until ${instanceName}-group \
            --stable \
            --region=${region}
        )
    done

You can have a look at this documentation.

Delayed sequential restart of Compute Engine VMs in Managed Instance Groups

2 Answers2