I've recently started a neural network project on Google Colab, and I discovered that I could use a TPU. I've been researching about how to use it and I discovered tensorflow's TPUStrategy
(I'm using tensorflow 2.2.0), and have been able to successfully define the model and run a train step on the TPU.
However, I'm not exactly sure what that means. It might be that I didn't read Google's TPU guide thoroughly enough, but I mean I don't know what exactly happens during a train step.
The guide asks you to define a GLOBAL_BATCH_SIZE
, and the batch size that each TPU core takes is given by per_replica_batch_size = GLOBAL_BATCH_SIZE / strategy.num_replicas_in_sync
, which means that the batch size per TPU is less than the batch size you start with. On Colab, strategy.num_replicas_in_sync = 8
, which means if I start with a GLOBAL_BATCH_SIZE
of 64, the per_replica_batch_size
is 8.
Now, what I don't understand is whether, when I compute a train step, the optimizer computes 8 different steps on batches of size per_replica_batch_size
, updating the weights of the model 8 different times, or it just parallelizes computation of the train step this way and in the end computes only 1 optimizer step on a batch of size GLOBAL_BATCH_SIZE
. Thanks.