0

I am looking to train a model using Google Cloud's new service - the Unified AI Platform. To do so I am using a config.yaml that looks like this:

workerPoolSpecs:
  workerPoolSpec:
    machineSpec:
      machineType: n1-highmem-16
      acceleratorType: NVIDIA_TESLA_P100
      acceleratorCount: 2
    replicaCount: 1
    pythonPackageSpec:
      executorImageUri: us-docker.pkg.dev/cloud-aiplatform/training/tf-gpu.2-4:latest
      packageUris: gs://path/to/bucket/unified_ai_platform/src_dist/trainer-0.1.tar.gz
      pythonModule: trainer.task
  workerPoolSpec:
    machineSpec:
      machineType: n1-highmem-16
      acceleratorType: NVIDIA_TESLA_P100
      acceleratorCount: 2
    replicaCount: 2
    pythonPackageSpec:
      executorImageUri: us-docker.pkg.dev/cloud-aiplatform/training/tf-gpu.2-4:latest
      packageUris: gs://path/to/bucket/unified_ai_platform/src_dist/trainer-0.1.tar.gz
      pythonModule: trainer.task

However for distributed training I am unable to understand how to pass multiple workerPoolSpecs in this file. The example yaml file provided does not look at the case wherein I can provide multiple workerPoolSpecs.

The example's documentation also saying that "You can specify multiple worker pool specs in order to create a custom job with multiple worker pools".

Any help in this regard will be appreciated.

Jash Shah
  • 2,064
  • 4
  • 23
  • 41
  • As per this [doc](https://cloud.google.com/ai-platform-unified/docs/training/configure-compute#specifying_gpus) GPU that you choose must be available in the location where you are performing custom training. – Mahboob Mar 22 '21 at 18:33
  • @Mahboob the GPUs are okay. I wanted to know how to specify multiple worker pool specs in the config file. – Jash Shah Mar 22 '21 at 19:10
  • @JashShah are you getting a error message? if yes, make sure to add it to the question. – Ismail Mar 23 '21 at 17:13

1 Answers1

2

Answering my own question. The config.yaml file should look like this:

workerPoolSpecs:
  - machineSpec:
      machineType: n1-standard-16
      acceleratorType: NVIDIA_TESLA_P100
      acceleratorCount: 2
    replicaCount: 1
    containerSpec:
      imageUri: gcr.io/path/to/container:v2
      args: 
        - --model-dir=gs://path/to/model
        - --tfrecord-dir=gs://path/to/training/data/
        - --epochs=2
  - machineSpec:
      machineType: n1-standard-16
      acceleratorType: NVIDIA_TESLA_P100
      acceleratorCount: 2
    replicaCount: 2
    containerSpec:
      imageUri: gcr.io/path/to/container:v2
      args: 
        - --model-dir=gs://path/to/models
        - --tfrecord-dir=gs://path/to/training/data/
        - --epochs=2
Jash Shah
  • 2,064
  • 4
  • 23
  • 41