1

I am trying to use TPUEstimator with train_and_evaluate() for an experiment on GCMLE. The TPUEstimator has a required argument train_batch_size that obviously specifies the batch size. However, for train_and_evaluate() I also specify a batch size through the TrainSpec:

train_input = lambda: input_fn(
    filenames = hparams.train_files,
    batch_size = hparams.train_batch_size,
    hparams = hparams,
    num_epochs = hparams.num_epochs, 
    shuffle=True,
    skip_header_lines=1
    )

train_spec = tf.estimator.TrainSpec(train_input, max_steps = hparams.train_steps)

estimator = tpu_estimator.TPUEstimator(
    use_tpu=True,
    model_fn=model_fn,
    config=run_config,
    train_batch_size = hparams.train_batch_size,
    eval_batch_size = hparams.eval_batch_size,
    )
tf.estimator.train_and_evaluate(tpu_estimator, train_spec, eval_spec)

In this example, consider that train_input within train_spec has it's own batch_size specified (for something like tf.train.batch() or tf.datasets.batch()) and also train_batch_size is a requirement of a TPUEstimator.

This seems very sloppy to me to have train_batch_size passed in two different places -- is the recommendation just to make sure that the same batch size is passed to both TPUEstimator and the TrainSpec? If the batch_size in TPUEstimator differed from the batch_size in the TrainSpec passed to train_and_evaluate() what would take preference? Is there a better way to use train_and_evaluate() with a TPUEstimator and not need to pass this batch_size in two different places?

Additionally, it appears that TPUEstimator automatically creates params['batch_size'] which appears to be the "effective batch size" according to documentation. How does the effctive batch size related to train_batch_size? If my train_batch_size is 1024, is the "effective batch size" 128 (because of the 8 cores)?

reese0106
  • 2,011
  • 2
  • 16
  • 46

2 Answers2

3

The batch size handling is slightly different between normal Estimator and TPUEstimator.

For normal Estimator, the batch size is not explicitly visible to Estimator; instead, it is part of the input_fn story, like your example is doing.

For TPU, batch size is handled differently. To be specific, the "xxx_batch_size" family, e.g., train batch size, in TPUEstimator constructor is the global batch size for your model. By changing the tf.contrib.tpu.TPUConfig.per_host_input_for_training, your input_fn is invoked by TPUEstimator in different ways.

Here, the params['batch_size'] is the shard batch size, calculated by the train_batch_size in constructor.

A concrete example is: Say, train_batch_size is 64, and for Cloud TPU,

  • if per_host_input_for_training is False, input_fn will be invoked 8 times on Cloud TPU (this is called per-core mode). In this case, the params['batch_size'] in input_fn is 64/8=8. The total global batch size your model sees is 64, which is the train_batch_size above passed via TPUEstimator constructor.

  • If flipping the per_host_input_for_training to bool true, params['batch_size'] in input_fn will be 64 (not 64/8) and the input_fn will be called only once. So, global batch size is still 64.

The same input_fn can work in both case.

For TPU Pods, this is the same story as params['batch_size'] is the shard batch size with respect to each host.

To summarize:

  1. The global batch size should be passed via TPUEstimator constructor.

  2. The input_fn should take the shard batch size from params['batch_size'] and respect that to create your dataset.

Hope this helps.

J. Xie
  • 86
  • 4
1

You should call train and evaluate separately instead of train_and_evaluate. train_and_evaluate appears to be trying to setup a distributed cluster in a different way than train or evaluate do individually.

lwz1992
  • 304
  • 1
  • 4