19

I have a Docker container that performs a single large computation. This computation requires lots of memory and takes about 12 hours to run.

I can create a Google Compute Engine VM of the appropriate size and use the "Deploy a container image to this VM instance" option to run this job perfectly. However once the job is finished the container quits but the VM is still running (and charging).

How can I make the VM exit/stop/delete when the container exits?

When the VM is in its zombie mode only the stackdriver containers are left running:

$ docker ps
CONTAINER ID        IMAGE                                                                COMMAND                  CREATED             STATUS              PORTS               NAMES
bfa2feb03180        gcr.io/stackdriver-agents/stackdriver-logging-agent:0.2-1.5.33-1-1   "/entrypoint.sh /u..."   17 hours ago        Up 17 hours                             stackdriver-logging-agent
161439a487c2        gcr.io/stackdriver-agents/stackdriver-metadata-agent:0.2-0.0.17-2    "/bin/sh -c /opt/s..."   17 hours ago        Up 17 hours         8000/tcp            stackdriver-metadata-agent

I create the VM like this:

gcloud beta compute --project=abc instances create-with-container vm-name \
                    --zone=us-central1-c --machine-type=custom-1-65536-ext \
                    --network=default --network-tier=PREMIUM --metadata=google-logging-enabled=true \
                    --maintenance-policy=MIGRATE \
                    --service-account=xyz \
                    --scopes=https://www.googleapis.com/auth/cloud-platform \
                    --image=cos-stable-69-10895-71-0 --image-project=cos-cloud --boot-disk-size=10GB \
                    --boot-disk-type=pd-standard --boot-disk-device-name=vm-name \
                    --container-image=gcr.io/abc/my-image --container-restart-policy=on-failure \
                    --container-command=python3 \
                    --container-arg="a" --container-arg="b" --container-arg="c" \
                    --labels=container-vm=cos-stable-69-10895-71-0
Adam
  • 16,808
  • 7
  • 52
  • 98

6 Answers6

17

When you create the VM, you'll need to give it write access to compute so you can delete the instance from within. You should also set container environment variables like gce_zone and gce_project_id at this time. You'll need them to delete the instance.

gcloud beta compute instances create-with-container {NAME} \
    --container-env=gce_zone={ZONE},gce_project_id={PROJECT_ID} \
    --service-account={SERVICE_ACCOUNT} \
    --scopes=https://www.googleapis.com/auth/compute,...
    ...

Then within the container, whenever YOU determine your task is finished:

  1. request an api token (im using curl for simplicity and DEFAULT gce service account)
curl "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token" -H "Metadata-Flavor: Google"

This will respond with json that looks like

{
  "access_token": "foobarbaz...",
  "expires_in": 1234,
  "token_type": "Bearer"
}
  1. Take that access token and hit the instances.delete api endpoint (notice the environment variables)
curl -XDELETE -H 'Authorization: Bearer {TOKEN}' https://www.googleapis.com/compute/v1/projects/$gce_project_id/zones/$gce_zone/instances/$HOSTNAME
Vincent
  • 1,553
  • 1
  • 11
  • 21
  • Do you know the deal with `--scopes=https://www.googleapis.com/auth/compute`? My scheme only works if I *omit* that. When I include it the VM's console has this error `Error: Failed to start container: Error response from daemon: {"message":"unauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication"}` – Adam Oct 15 '18 at 06:49
  • 1
    @Adam thats strange. Well I suppose as long as you have it working, thats the way to go :P. – Vincent Oct 15 '18 at 07:51
  • how do you set gce_zone and gce_project_id ? – mcmillab Apr 06 '21 at 18:07
  • 1
    @mcmillab use the —container-env arg like the answer does – Vincent Apr 06 '21 at 18:10
  • 2
    Anyone else hit a `Request had insufficient authentication scopes` error when trying this? – ateymour Dec 24 '21 at 01:05
  • The deal with scopes is if you pass in only the `https://***/compute` scope you don't get all the default scopes and you can't pull a container. To solve the issue create your container without specifying scope, then describe it, then add compute to the list of scope. Use: `gcloud beta compute instances describe CONTAINER_NAME` to get the default scopes. – mooli Aug 10 '22 at 06:51
12

Having grappled with the problem for some time, here's a full solution that works pretty well.

This solution doesn't use the "start machine with a container image" option. Instead it uses a startup script, which is more flexible. You still use a Container-Optimized OS instance.

  1. Create a startup script:
#!/usr/bin/env bash

# get image name and container parameters from the metadata
IMAGE_NAME=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/image_name -H "Metadata-Flavor: Google")

CONTAINER_PARAM=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/container_param -H "Metadata-Flavor: Google")

# This is needed if you are using a private images in GCP Container Registry
# (possibly also for the gcp log driver?)
sudo HOME=/home/root /usr/bin/docker-credential-gcr configure-docker

# Run! The logs will go to stack driver 
sudo HOME=/home/root  docker run --log-driver=gcplogs ${IMAGE_NAME} ${CONTAINER_PARAM}

# Get the zone
zoneMetadata=$(curl "http://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor:Google")
# Split on / and get the 4th element to get the actual zone name
IFS=$'/'
zoneMetadataSplit=($zoneMetadata)
ZONE="${zoneMetadataSplit[3]}"

# Run compute delete on the current instance. Need to run in a container 
# because COS machines don't come with gcloud installed 
docker run --entrypoint "gcloud" google/cloud-sdk:alpine compute instances delete ${HOSTNAME}  --delete-disks=all --zone=${ZONE}
  1. Put the script somewhere public. For example put it on Cloud Storage and create a public URL. You can't use a gs:// URI for a COS startup script.

  2. Start an instance using a startup-script-url, and passing the image name and parameters, e.g.:

gcloud compute --project=PROJECT_NAME instances create INSTANCE_NAME  \
--zone=ZONE --machine-type=TYPE \
--metadata=image_name=IMAGE_NAME,\
container_param="PARAM1 PARAM2 PARAM3",\
startup-script-url=PUBLIC_SCRIPT_URL \
--maintenance-policy=MIGRATE --service-account=SERVICE_ACCUNT \
--scopes=https://www.googleapis.com/auth/cloud-platform --image-family=cos-stable \
--image-project=cos-cloud --boot-disk-size=10GB --boot-disk-device-name=DISK_NAME

(You probably want to limit the scopes, the example uses full access for simplicity)

daphshez
  • 9,272
  • 11
  • 47
  • 65
  • 1
    This worked for me! And it's fully automated as opposed to the other solutions. Step 2 isn't necessary by the way: instead of specifying `startup-script-url` in the `metadata` argument, you can specify a local path to the script with: `--metadata-from-file "startup-script=path/to/startup-script.sh"` [Google Cloud docs: using a local startup script](https://cloud.google.com/compute/docs/startupscript#using-a-local-startup-script-file) – Julian Ferry Aug 09 '20 at 19:48
  • What's the best way to kick this off on a timer? Use cloud scheduler to run that script in a container on cloud run? – Robert Moskal Oct 20 '21 at 03:29
10

I wrote a self-contained Python function based on Vincent's answer.

def kill_vm():
    """
    If we are running inside a GCE VM, kill it.
    """
    # based on https://stackoverflow.com/q/52748332/321772
    import json
    import logging
    import requests

    # get the token
    r = json.loads(
        requests.get("http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token",
                     headers={"Metadata-Flavor": "Google"})
            .text)

    token = r["access_token"]

    # get instance metadata
    # based on https://cloud.google.com/compute/docs/storing-retrieving-metadata
    project_id = requests.get("http://metadata.google.internal/computeMetadata/v1/project/project-id",
                              headers={"Metadata-Flavor": "Google"}).text

    name = requests.get("http://metadata.google.internal/computeMetadata/v1/instance/name",
                        headers={"Metadata-Flavor": "Google"}).text

    zone_long = requests.get("http://metadata.google.internal/computeMetadata/v1/instance/zone",
                             headers={"Metadata-Flavor": "Google"}).text
    zone = zone_long.split("/")[-1]

    # shut ourselves down
    logging.info("Calling API to delete this VM, {zone}/{name}".format(zone=zone, name=name))

    requests.delete("https://www.googleapis.com/compute/v1/projects/{project_id}/zones/{zone}/instances/{name}"
                    .format(project_id=project_id, zone=zone, name=name),
                    headers={"Authorization": "Bearer {token}".format(token=token)})

A simple atexit hook gets me my desired behavior:

import atexit
atexit.register(kill_vm)
Adam
  • 16,808
  • 7
  • 52
  • 98
  • This is fine, but i think its best to avoid the metadata server as much as possible. $HOSTNAME should already be set on the instance and as said in my answer, it may be prudent to set the zone/project id as container env variables when you create the VM so you don't have to fetch from the metadata server. Anyway, that's my two cents. Nice implementation. – Vincent Oct 15 '18 at 08:09
  • 1
    Out of curiosity, why do you prefer to avoid the metadata server? – Adam Oct 15 '18 at 16:09
  • Well i think it’s always better avoiding going over the wire when you can. That’s all. – Vincent Oct 15 '18 at 17:09
  • 2
    Understandable. I suspect these calls are serviced by the local host, or something extremely close to it, as the return value comes back instantaneously. Besides, this is only 3 calls on VM shutdown, and the other 2 are unavoidable anyway. So I'm ok with it :) The upside is that the method is self-contained and doesn't require care during deployment. – Adam Oct 15 '18 at 20:18
  • @Adam You could also request DELETE API with `zone_long` and `name` to reduce the metadata requests. – northtree May 08 '19 at 01:03
  • @northtree which metadata request would that eliminate? I already fetch `zone_long` and `name`. – Adam May 08 '19 at 05:19
  • @Adam You don't have to request `project_id`. – northtree May 08 '19 at 05:23
  • 1
    what would the new URL be? – Adam May 08 '19 at 16:38
1

Another solution is to not use GCE and instead use AI Platform's custom job service, which automatically shuts down the VM after the Docker container exits.

gcloud ai-platform jobs submit training $JOB_NAME \
  --region $REGION \
  --master-image-uri $IMAGE_URI

You can specify --master-machine-type.

See the GCP documentation on custom containers.

Tom Phillips
  • 1,840
  • 17
  • 18
1

The simplest way, from within the container, once it's finished:

ZONE=`gcloud compute instances list --filter="name=($HOSTNAME)" --format 'csv[no-heading](zone)'`

gcloud compute instances delete $HOSTNAME --zone=$ZONE -q

-q skips the interactive confirmation

$HOSTNAME is already exported

rso
  • 11
  • 1
1

Just use curl and the local metadata server (no need for Python scripts or gcloud). Add the following to the end of your Docker Entrypoint script, so it's run when the container finishes:

# Note: inside the container the name is exposed as $HOSTNAME
INSTANCE_NAME=$(curl -sq "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google")
INSTANCE_ZONE=$(curl -sq "http://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor: Google")

echo "Terminating instance [${INSTANCE_NAME}] in zone [${INSTANCE_ZONE}}"
TOKEN=$(curl -sq "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token" -H "Metadata-Flavor: Google" | jq -r '.access_token')
curl -X DELETE -H "Authorization: Bearer ${TOKEN}" https://www.googleapis.com/compute/v1/$INSTANCE_ZONE/instances/$INSTANCE_NAME

For security sake, and Principle of Least Privilege, you can run the VM with a custom service account, and give that service account a role, with this permission (a custom role is best).

compute.instances.delete
Joseph Lust
  • 19,340
  • 7
  • 85
  • 83