3

I'm running custom training jobs in google's Vertex AI. A simple gcloud command to execute a custom job would use something like the following syntax (complete documentation for the command can be seen here):

gcloud beta ai custom-jobs create --region=us-central1 \
--display-name=test \
--config=config.yaml

In the config.yaml file, it is possible to specify the machine and accelerator (GPU) types, etc., and in my case, point to a custom container living in the Google Artifact Registry that executes the training code (specified in the imageUri part of the containerSpec). An example config file may look like this:

# config.yaml
workerPoolSpecs:
  machineSpec:
    machineType: n1-highmem-2
    acceleratorType: NVIDIA_TESLA_P100
    acceleratorCount: 2
  replicaCount: 1
  containerSpec:
    imageUri: {URI_FOR_CUSTOM_CONATINER}
    args:
    - {ARGS TO PASS TO CONTAINER ENTRYPOINT COMMAND}

The code we're running needs some runtime environment variables (that need to be secure) passed to the container. In the API documentation for the containerSpec, it says it is possible to set environment variables as follows:

# config.yaml
workerPoolSpecs:
  machineSpec:
    machineType: n1-highmem-2
    acceleratorType: NVIDIA_TESLA_P100
    acceleratorCount: 2
  replicaCount: 1
  containerSpec:
    imageUri: {URI_FOR_CUSTOM_CONATINER}
    args:
    - {ARGS TO PASS TO CONTAINER ENTRYPOINT COMMAND}
    env:
    - name: SECRET_ONE
      value: $SECRET_ONE
    - name: SECRET_TWO
      value: $SECRET_TWO

When I try and add the env flag to the containerSpec, I get an error saying it's not part of the container spec:

ERROR: (gcloud.beta.ai.custom-jobs.create) INVALID_ARGUMENT: Invalid JSON payload received. Unknown name "env" at 'custom_job.job_spec.worker_pool_specs[0].container_spec': Cannot find field.
- '@type': type.googleapis.com/google.rpc.BadRequest
  fieldViolations:
  - description: "Invalid JSON payload received. Unknown name \"env\" at 'custom_job.job_spec.worker_pool_specs[0].container_spec':\
      \ Cannot find field."
    field: custom_job.job_spec.worker_pool_specs[0].container_spec

Any idea how to securely set runtime environment variables in Vertex AI custom jobs using custom containers?

JmeCS
  • 497
  • 4
  • 17
  • I think it's a bug either in the [ContainerSpec](https://cloud.google.com/vertex-ai/docs/reference/rest/v1/CustomJobSpec#ContainerSpec) documentation or its implementation. Your approach is correct and you ought to be able to define environment variables as you are doing. I recommend filing a bug on Google's [Issue Tracker](https://issuetracker.google.com) for [Cloud Machine Learning Engine](https://issuetracker.google.com/issues?q=componentid:187220) – DazWilkin Sep 23 '21 at 15:38
  • Thank you I'll give that a shot – JmeCS Sep 23 '21 at 16:06
  • 1
    You're welcome! For posterity: https://issuetracker.google.com/issues/200923643 – DazWilkin Sep 23 '21 at 23:49
  • @JmeCS Can you try the `gcloud` command without the `beta` parameter? There are two versions of the REST API - [v1](https://cloud.google.com/vertex-ai/docs/reference/rest/v1/CustomJobSpec#containerspec) and [v1beta1](https://cloud.google.com/vertex-ai/docs/reference/rest/v1beta1/CustomJobSpec#containerspec) where "v1beta1" does not have the `env` option in `ContainerSpec` but "v1" does. The `gcloud ai custom-jobs create` does not throw the error. – Kabilan Mohanraj Sep 24 '21 at 07:05
  • @KabilanMohanraj - Thanks I had no idea that there was a non-beta version of the API. You are right that the "v1" spec does not throw an error! Are you aware if it is possible to grab environment variables from the machine that the job is started from? I've tried setting values in the yaml like: ${var} but it doesn't work – JmeCS Sep 24 '21 at 19:09
  • can you make an example on how to pass args? – Galuoises Jul 28 '22 at 14:29
  • @Galuoises under the `containerSpec` add: `args:` `- 'your arg here'` – JmeCS Aug 01 '22 at 20:10

1 Answers1

4

There are two versions of the REST API - “v1” and “v1beta1” where "v1beta1" does not have the env option in ContainerSpec but "v1" does. The gcloud ai custom-jobs create command without the beta parameter doesn’t throw the error as it uses version “v1” to make the API calls.

The environment variables from the yaml file can be passed to the custom container in the following way:

This is the docker file of the sample custom training application I used to test the requirement. Please refer to this codelab for more information about the training application.

FROM gcr.io/deeplearning-platform-release/tf2-cpu.2-3
WORKDIR /root

WORKDIR /

# Copies the trainer code to the docker image.
COPY trainer /trainer


# Copies the bash script to the docker image.
COPY commands.sh /scripts/commands.sh

# Bash command to make the script file an executable
RUN ["chmod", "+x", "/scripts/commands.sh"]


# Command to execute the file
ENTRYPOINT ["/scripts/commands.sh"]

# Sets up the entry point to invoke the trainer.
# ENTRYPOINT "python" "-m" $SECRET_TWO ⇒ To use the environment variable  
# directly in the docker ENTRYPOINT. In case you are not using a bash script, 
# the trainer can be invoked directly from the docker ENTRYPOINT.

Below is the commands.sh file used in the docker container to test whether the environment variables are passed to the container.

#!/bin/bash
mkdir /root/.ssh
echo $SECRET_ONE
python -m $SECRET_TWO

The example config.yaml file

# config.yaml
workerPoolSpecs:
  machineSpec:
    machineType: n1-highmem-2
  replicaCount: 1
  containerSpec:
    imageUri: gcr.io/infosys-kabilan/mpg:v1
    env:
    - name: SECRET_ONE
      value: "Passing the environment variables"
    - name: SECRET_TWO
      value: "trainer.train"

As the next step, I built and pushed the container to Google Container Repository. Now, the gcloud ai custom-jobs create --region=us-central1 --display-name=test --config=config.yaml can be run to create the custom training job and the output of the commands.sh file can be seen in the job logs as shown below.

enter image description here

Kabilan Mohanraj
  • 1,856
  • 1
  • 7
  • 17