0

I am trying to run a training job on Google Cloud ML Engine. I am submitting the job using

gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%s` \
--job-dir=gs://${YOUR_GCS_BUCKET}/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
--module-name object_detection.model_tpu_main \
--runtime-version 1.13 \
--scale-tier BASIC_TPU \
--region us-central1 \
-- \
--model_dir=gs://${YOUR_GCS_BUCKET}/train \
--tpu_zone us-central1 \
--pipeline_config_path=gs://${YOUR_GCS_BUCKET}/data/pipeline.config

However, after the job is created and all the required packages are installed, I start to repeatedly get these messages:

enter image description here

until the job fails with this output:

enter image description here

I have already tried this, this and this without any success.

I suppose the problem is related to authentification, so I followed this tutorial, but that didn't help.

Any help is very appreciated!

Paktalin
  • 236
  • 4
  • 16

1 Answers1

0

Seems like there are some issues with TPU allocation. I solved the problem by changing the TPU to GPU, so the command for job submitting is changed to

gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%s` \
--job-dir=gs://${YOUR_GCS_BUCKET}/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
--module-name object_detection.model_main \
--runtime-version 1.13 \
--scale-tier BASIC_GPU \
--region us-central1 \
-- \
--model_dir=gs://${YOUR_GCS_BUCKET}/train \
--pipeline_config_path=gs://${YOUR_GCS_BUCKET}/data/pipeline.config

UPDATE

I have contacted @Yash Sonthalia as he asked me to do. Very shortly the problem got fixed. Thanks!

Paktalin
  • 236
  • 4
  • 16
  • This is not a solution. Why would someone want to change from TPU to GPU? There is a reason people paying money for TPU right? – ramgorur Mar 31 '21 at 01:46