Google Cloud ML: repeating "Attempting refresh to obtain initial access_token", then "Job failed"

Question

I am trying to run a training job on Google Cloud ML Engine. I am submitting the job using

gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%s` \
--job-dir=gs://${YOUR_GCS_BUCKET}/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
--module-name object_detection.model_tpu_main \
--runtime-version 1.13 \
--scale-tier BASIC_TPU \
--region us-central1 \
-- \
--model_dir=gs://${YOUR_GCS_BUCKET}/train \
--tpu_zone us-central1 \
--pipeline_config_path=gs://${YOUR_GCS_BUCKET}/data/pipeline.config

However, after the job is created and all the required packages are installed, I start to repeatedly get these messages:

until the job fails with this output:

I have already tried this, this and this without any success.

I suppose the problem is related to authentification, so I followed this tutorial, but that didn't help.

Any help is very appreciated!

Can you send your job-id to cloudml-feedback@google.com – Yash Sonthalia Jul 02 '19 at 17:27 — Yash Sonthalia, Jul 02 '19 at 17:27
@YashSonthalia sure! Thanks :) – Paktalin Jul 02 '19 at 22:49 — Paktalin, Jul 02 '19 at 22:49

Paktalin · Accepted Answer · 2019-07-08T21:05:15.587

Seems like there are some issues with TPU allocation. I solved the problem by changing the TPU to GPU, so the command for job submitting is changed to

gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%s` \
--job-dir=gs://${YOUR_GCS_BUCKET}/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
--module-name object_detection.model_main \
--runtime-version 1.13 \
--scale-tier BASIC_GPU \
--region us-central1 \
-- \
--model_dir=gs://${YOUR_GCS_BUCKET}/train \
--pipeline_config_path=gs://${YOUR_GCS_BUCKET}/data/pipeline.config

UPDATE

I have contacted @Yash Sonthalia as he asked me to do. Very shortly the problem got fixed. Thanks!

This is not a solution. Why would someone want to change from TPU to GPU? There is a reason people paying money for TPU right? — ramgorur, Mar 31 '21 at 01:46

Google Cloud ML: repeating "Attempting refresh to obtain initial access_token", then "Job failed"

1 Answers1