2

I just converted an existing project from TF 1.14 to TF 2.1 which uses the TPUEstimator API. After making the conversion, testing locally (i.e. use_tpu=False) runs successfully. However, I am getting errors when running on Google Cloud TPU (i.e. use_tpu=True).

Note: This is in the context of the AdaNet AutoML framework (v0.8.0), although I suspect this may be a general TPUEstimator-related error, as the errors appear to originate in the tpu_estimator.py and error_handling.py scripts seen in the Traceback below:

  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3032, in train
    rendezvous.record_error('training_loop', sys.exc_info())
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 81, in record_error
    if value and value.op and value.op.type == _CHECK_NUMERIC_OP_NAME:
  AttributeError: 'RuntimeError' object has no attribute 'op'

  During handling of the above exception, another exception occurred:  

  File "workspace/trainer/train.py", line 331, in <module>
    main(args=parsed_args)
  File "workspace/trainer/train.py", line 177, in main
    run_config=run_config)
  File "workspace/trainer/train.py", line 68, in run_experiment
    estimator.train(input_fn=train_input_fn, max_steps=total_train_steps)
  File "/usr/local/lib/python3.6/site-packages/adanet/core/estimator.py", line 853, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 143, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.6/site-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1194, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2857, in _call_model_fn
    config)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1152, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3186, in _model_fn
    host_ops = host_call.create_tpu_hostcall()
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2226, in create_tpu_hostcall
    'dimension, but got scalar {}'.format(dequeue_ops[i][0]))
RuntimeError: All tensors outfed from TPU should preserve batch size dimension, but got scalar Tensor("OutfeedDequeueTuple:1", shape=(), dtype=int64, device=/job:tpu_worker/task:0/device:CPU:0)'

The previous version of the project using TF 1.14 runs both locally and on TPU using TPUEstimator without issues. Is there something obvious I am potentially missing for the conversion over to TF 2.1 when using TPUEstimator API?

1 Answers1

0

Have you applied the following:

dataset = ...
dataset = dataset.apply(tf.contrib.data.batch_and_drop_remainder(batch_size))

this potentially drops the last few samples from a file to ensure that every batch has a static shape of batch_size, which is required when training on TPUs.

Jan Krynauw
  • 1,042
  • 10
  • 21
  • I have indeed taken measures to ensure a static batch_size as per requirements on TPU. Previously, I applied this using the `dataset = (dataset.apply(tf.data.experimental.map_and_batch(...,batch_size, drop_remainder=True))` As part of the update I have moved to the using `dataset = dataset.map(...).batch(batch_size, drop_remainder=True)` as the fused map_and_batch function is considered deprecated. – Nicholas Breckwoldt Jun 11 '20 at 07:41