TPU Based Tuning for CloudML

Question

Are TPUs supported for distributed hyperparameter search? I'm using the tensor2tensor library, which supports CloudML for hyperparameter search, i.e., the following works for me to conduct hyperparameter search for a language model on GPUs:

t2t-trainer \
  --model=transformer \
  --hparams_set=transformer_tpu \
  --problem=languagemodel_lm1b8k_packed \
  --train_steps=100000 \
  --eval_steps=8 \
  --data_dir=$DATA_DIR \
  --output_dir=$OUT_DIR \
  --cloud_mlengine \
  --hparams_range=transformer_base_range \
  --autotune_objective='metrics-languagemodel_lm1b8k_packed/neg_log_perplexity' \
  --autotune_maximize \
  --autotune_max_trials=100 \
  --autotune_parallel_trials=3

However, when I try to utilize TPUs as in the following:

t2t-trainer \
  --problem=languagemodel_lm1b8k_packed \
  --model=transformer \
  --hparams_set=transformer_tpu \
  --data_dir=$DATA_DIR \
  --output_dir=$OUT_DIR \
  --train_steps=100000 \
  --use_tpu=True \
  --cloud_mlengine_master_type=cloud_tpu \
  --cloud_mlengine \
  --hparams_range=transformer_base_range \
  --autotune_objective='metrics-languagemodel_lm1b8k_packed/neg_log_perplexity' \
  --autotune_maximize \
  --autotune_max_trials=100 \
  --autotune_parallel_trials=5

I get the error:

googleapiclient.errors.HttpError: <HttpError 400 when requesting https://ml.googleapis.com/v1/projects/******/jobs?alt=json returned "Field: master_type Error: The specified machine type for masteris not supported in TPU training jobs: cloud_tpu"

score 3 · Answer 1 · answered Aug 02 '18 at 01:41

3

One of the authors of the tensor2tensor library here. Yup, this was indeed a bug and is now fixed. Thanks for spotting. We'll release a fixed version on PyPI this week, and you can of course clone and install locally from master until then.

The command you used should work just fine now.

answered Aug 02 '18 at 01:41

Ryan Sepassi

1,501
2
10
5

The [Tensor2Tensor 1.7](https://pypi.org/project/tensor2tensor/#history) release includes the fix. – Ryan Sepassi Aug 09 '18 at 17:48

score 2 · Answer 2 · answered Jul 27 '18 at 17:36

I believe there is a bug in the tensor2tensor library: https://github.com/tensorflow/tensor2tensor/blob/6a7ef7f79f56fdcb1b16ae76d7e61cb09033dc4f/tensor2tensor/utils/cloud_mlengine.py#L281

It's the worker_type (and not the master_type) that needs to be set for Cloud ML Engine.

To answer the original question though, yes, HP Tuning should be supported for TPUs, but the error above is orthogonal to that.

TPU Based Tuning for CloudML

2 Answers2