Loss and learning rate scaling strategies for Tensorflow distributed training when using TF Estimator

Question

For those who don't want to read the whole story:

TL; DR: When using TF Estimator, do we have to scale learning rate by the factor by which we increase batch size (I know this is the right way, I am not sure if TF handles this internally)? Similarly, do we have to scale per example loss by global batch size (batch_size_per_replica * number of replicas)?

Documentation on Tensorflow distributed learning is confusing. I need clarification on below points.

It is now understood that if you increase the batch size by a factor of k then you need to increase the learning rate by k (see this and this paper). However, Tensoflow official page on distributed learning makes no clarifying comment about this. They do mention here that learning rate needs to be adjusted. Do they handle the learning rate scaling by themselves? To make matters more complicated, the behavior is different in Keras and tf.Estimator (see next point). Any suggestions on should I increase the LR by a factor of K or not when I am using tf.Estimator?
It is widely accepted that the per example loss should be scaled by global_batch_size = batch_size_per_replica * number of replicas. Tensorflow mentions it here but then when illustrating how to achieve this with a tf.Estimator, they either forget or the scaling by global_batch_size is not required. See here, in the code snippet, loss is defined as follows.

loss = tf.reduce_sum(loss) * (1. / BATCH_SIZE)

and BATCH_SIZE to the best of my understanding is defined above as per replica batch size.

To complicate things further, the scaling is handled automatically if you are using Keras (for reasons I will never understand, it would have been better to keep everything consistent).

I don't know about distributed, but non-distributed is indeed scaled, with source code I could link. You could, however, verify this scaling fairly easily yourself: 1) fix all random seeds; 2) feed N _same_ samples, record changes in weights; 3) restart training (restart Python kernel); 4) feed 2 * N _same_ samples, record changes in weights. If (2) == (4), it's auto-scaled. — OverLordGoldDragon, May 27 '20 at 21:35
Alternatively, a minimally-reproducible code would be easier to inspect. — OverLordGoldDragon, May 27 '20 at 21:36

score 1 · Answer 1 · answered May 30 '20 at 13:49

The learning rate is not automatically scaled by the global step. As you said, they even suggest that you might need to adjust the learning rate, but then again only in some cases, so that's not the default. I suggest that you do increase the learning rate manually.
If we take a look at a simple tf.Estimator, the tf.estimator.DNNClassifier (link), the default loss_reduction is losses_utils.ReductionV2.SUM_OVER_BATCH_SIZE. If we got to that Reduction (found here), we see that it's a policy for how to combine losses form indiviual samples together. On a single machine, we would just use tf.reduce_mean, but you can't use that in a distributed setting (as mentioned in the next link). The Reduction leads us to here, which shows you 1) an implementation of how you would implement the global step and 2) explains why. As they are telling you you should implement this yourself, this implies it's not handled by tf.Estimator. Also note that yo can find some explanations on the Reduction page they state the differences bewteen Keras and Estimators about these params.

To summarize, 1. We should modify the LR ourselves. 2. Losses should be scaled by global batch size as the default reduction is add losses from different machines. Regarding point 2, do you know or can you point me to a code snippet that mentions the reduction policy for `MultiWorkerMirroredStrategy`? They keep saying `RING` and `NCCL` but I couldn't find an explicit implementation of that policy. — Autonomous, Jun 02 '20 at 18:04
If I scale the loss by (1/GlobalBatchSize) the overall distributed loss gets 1/x if I train it on x GPUs. posted this issue [here](https://stackoverflow.com/questions/71555766/distributed-training-with-tensorflow-on-x-gpu-makes-loss-1-x) . — steinum, Mar 22 '22 at 10:27

Loss and learning rate scaling strategies for Tensorflow distributed training when using TF Estimator

1 Answers1