Training MXNet with large batch size and small number of GPUs

Question

When training MXNet, if the batch size is large(say 128), and the number of GPUs is small(say 2), and each GPU can only handle a few samples each iteration(say 16). By default, the maximum batch size of this configuration is 16 * 2 = 32.

In theory, we can run 4 iterations before updating the weights, to make effective batch size 128. Is this possible with MXNet?

Vishaal · Answer 1 · 2019-01-24T19:38:03.553

Editing this answer with a more streamlined approach (memory-wise). You have to configure each parameter to accumulate gradients, run your 4 forward passes, run backwards, and then manually zero your gradients.

As per https://discuss.mxnet.io/t/aggregate-gradients-manually-over-n-batches/504/2

"This is very straightforward to do with Gluon. You need to set the grad_req in your network Parameter instances to 'add' and manually set the gradient to zero using zero_grad() after each Trainer.step() (see here). To set grad_req to 'add':

for p in net.collect_params().values():
    p.grad_req = 'add'

"And similarly call zero_grad() on each parameter after calling Trainer.step(). Remember to modify batch_size argument of trainer.step() accordingly."

Vishaal

Training MXNet with large batch size and small number of GPUs

1 Answers1