0

When training MXNet, if the batch size is large(say 128), and the number of GPUs is small(say 2), and each GPU can only handle a few samples each iteration(say 16). By default, the maximum batch size of this configuration is 16 * 2 = 32.

In theory, we can run 4 iterations before updating the weights, to make effective batch size 128. Is this possible with MXNet?

Targo
  • 67
  • 6

1 Answers1

1

Editing this answer with a more streamlined approach (memory-wise). You have to configure each parameter to accumulate gradients, run your 4 forward passes, run backwards, and then manually zero your gradients.

As per https://discuss.mxnet.io/t/aggregate-gradients-manually-over-n-batches/504/2

"This is very straightforward to do with Gluon. You need to set the grad_req in your network Parameter instances to 'add' and manually set the gradient to zero using zero_grad() after each Trainer.step() (see here). To set grad_req to 'add':

for p in net.collect_params().values():
    p.grad_req = 'add'

"And similarly call zero_grad() on each parameter after calling Trainer.step(). Remember to modify batch_size argument of trainer.step() accordingly."

Vishaal

Vishaal
  • 735
  • 3
  • 13