Editing this answer with a more streamlined approach (memory-wise). You have to configure each parameter to accumulate gradients, run your 4 forward passes, run backwards, and then manually zero your gradients.
As per https://discuss.mxnet.io/t/aggregate-gradients-manually-over-n-batches/504/2
"This is very straightforward to do with Gluon. You need to set the grad_req in your network Parameter instances to 'add' and manually set the gradient to zero using zero_grad() after each Trainer.step() (see here). To set grad_req to 'add':
for p in net.collect_params().values():
p.grad_req = 'add'
"And similarly call zero_grad() on each parameter after calling Trainer.step(). Remember to modify batch_size argument of trainer.step() accordingly."
Vishaal