2

I've recently started a neural network project on Google Colab, and I discovered that I could use a TPU. I've been researching about how to use it and I discovered tensorflow's TPUStrategy (I'm using tensorflow 2.2.0), and have been able to successfully define the model and run a train step on the TPU.

However, I'm not exactly sure what that means. It might be that I didn't read Google's TPU guide thoroughly enough, but I mean I don't know what exactly happens during a train step.

The guide asks you to define a GLOBAL_BATCH_SIZE, and the batch size that each TPU core takes is given by per_replica_batch_size = GLOBAL_BATCH_SIZE / strategy.num_replicas_in_sync, which means that the batch size per TPU is less than the batch size you start with. On Colab, strategy.num_replicas_in_sync = 8, which means if I start with a GLOBAL_BATCH_SIZE of 64, the per_replica_batch_size is 8.

Now, what I don't understand is whether, when I compute a train step, the optimizer computes 8 different steps on batches of size per_replica_batch_size, updating the weights of the model 8 different times, or it just parallelizes computation of the train step this way and in the end computes only 1 optimizer step on a batch of size GLOBAL_BATCH_SIZE. Thanks.

spectraldoy
  • 31
  • 1
  • 5

2 Answers2

1

This is a good question and is more related to Distribution Strategy.

After going through this Tensorflow Documentation,TPU Strategy Documentation and this explanation of Synchronous and Asynchronous Training,

I can say that

> the optimizer computes 8 different steps on batches of size
> per_replica_batch_size, updating the weights of the model 8 different
> times

The below explanation from Tensorflow Documentation should clarify that:

> So, how should the loss be calculated when using a
> tf.distribute.Strategy?
> 
> For an example, let's say you have 4 GPU's and a batch size of 64. One
> batch of input is distributed across the replicas (4 GPUs), each
> replica getting an input of size 16.
> 
> The model on each replica does a forward pass with its respective
> input and calculates the loss. Now, instead of dividing the loss by
> the number of examples in its respective input (BATCH_SIZE_PER_REPLICA
> = 16), the loss should be divided by the GLOBAL_BATCH_SIZE (64).

Providing the explanation from other links as well below (just in case they don't work in future):

TPU Strategy documentation states:

> In terms of distributed training architecture, `TPUStrategy` is the
> same `MirroredStrategy` - it implements `synchronous` distributed
> training. `TPUs` provide their own implementation of efficient
> `all-reduce` and other collective operations across multiple `TPU`
> cores, which are used in `TPUStrategy`.

The explanation of Synchronous and Asynchronous Training is shown below:

> `Synchronous vs asynchronous training`: These are two common ways of
> `distributing training` with `data parallelism`. In `sync training`, all
> `workers` train over different slices of input data in `sync`, and
> **`aggregating gradients`** at each step. In `async` training, all workers are
> independently training over the input data and updating variables
> `asynchronously`. Typically sync training is supported via all-reduce
> and `async` through parameter server architecture.

You can also go through this MPI Tutorial to understand the concept of All_Reduce in detail.

The screenshot below shows how All_Reduce works:

enter image description here

  • Thank you for your answer. I am doing in sync training, so the gradients are aggregated by All_Reduce, right? But doesn't that mean it's just updating the weights once with the summed gradients, rather than 8 times? Or does it mean the weights are updated 8 times with the same gradients? – spectraldoy Jul 23 '20 at 14:25
  • For the example of having 8 replicas and per_replica_batch_size = 8, I believe that the optimizer should update the weights only 1 time and just after the gradients from all replicas have been received. If it updated the weights 8 times, then there will be different results when compared with single device training (1 replica). – Orwa kassab Nov 23 '21 at 13:21
0

If you has few card to use, use GLOBAL_BATCH_SIZE parallel strategy can break the GPU memory boundaries and still use larger batch sizes.

It will computes 8 times when calculate grad of each mini-batch(per_replica_batch_size), but update the model variables only one time after the global batch gradients are calculated.

enter image description here

Reference: How to Break GPU Memory Boundaries Even with Large Batch Sizes

alexqinbj
  • 1,091
  • 3
  • 13
  • 27