What is the correct value of 'rescale_grad' in case of multi-GPU machine?

Question

My batch size is 512, I have 8 GPUs

Should I define: rescale_grad = 1. / 512 or rescale_grad = 1. / (8*512)

Thanks!

Increasing the batch size does not guarantee that a larger learning rate will work. However you can check out this recent Facebook paper https://arxiv.org/pdf/1706.02677.pdf for some strategies. — leezu, Oct 04 '17 at 03:12

score 0 · Accepted Answer · answered Oct 05 '17 at 23:37

Batch size is something that is tied to the computer and not to the GPU. Quote (from here):

Workload Partitioning

By default, MXNet partitions a data batch evenly among the available GPUs. Assume a batch size b and assume there are k GPUs, then in one iteration each GPU will perform forward and backward on b/k examples. The gradients are then summed over all GPUs before updating the model.

In your case b is 512. Therefore you should be using rescale_grad = 1. / 512

What is the correct value of 'rescale_grad' in case of multi-GPU machine?

1 Answers1