Are TPUs supported for distributed hyperparameter search? I'm using the tensor2tensor
library, which supports CloudML for hyperparameter search, i.e., the following works for me to conduct hyperparameter search for a language model on GPUs:
t2t-trainer \
--model=transformer \
--hparams_set=transformer_tpu \
--problem=languagemodel_lm1b8k_packed \
--train_steps=100000 \
--eval_steps=8 \
--data_dir=$DATA_DIR \
--output_dir=$OUT_DIR \
--cloud_mlengine \
--hparams_range=transformer_base_range \
--autotune_objective='metrics-languagemodel_lm1b8k_packed/neg_log_perplexity' \
--autotune_maximize \
--autotune_max_trials=100 \
--autotune_parallel_trials=3
However, when I try to utilize TPUs as in the following:
t2t-trainer \
--problem=languagemodel_lm1b8k_packed \
--model=transformer \
--hparams_set=transformer_tpu \
--data_dir=$DATA_DIR \
--output_dir=$OUT_DIR \
--train_steps=100000 \
--use_tpu=True \
--cloud_mlengine_master_type=cloud_tpu \
--cloud_mlengine \
--hparams_range=transformer_base_range \
--autotune_objective='metrics-languagemodel_lm1b8k_packed/neg_log_perplexity' \
--autotune_maximize \
--autotune_max_trials=100 \
--autotune_parallel_trials=5
I get the error:
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://ml.googleapis.com/v1/projects/******/jobs?alt=json returned "Field: master_type Error: The specified machine type for masteris not supported in TPU training jobs: cloud_tpu"