2

I tried to use gradient accumulation in my project. To my understanding, the gradient accumulation is the same as increasing the batch size by x times. I tried batch_size==32 and batch_size==8, gradient_accumulation==4 in my project, however the result varies even when I disabled shuffle in dataloader. The batch_size==8, accumulation==4 variant's result is significantly poorer.

I wonder why?

Here is my snippet:

loss = model(x)
epoch_loss += float(loss)

loss.backward()

# step starts from 1
if (step % accumulate_step == 0) or (step == len(dataloader)):

    if clip_grad_norm > 0:
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=clip_grad_norm)

    optimizer.step()
    if scheduler:
        scheduler.step()

    optimizer.zero_grad()
namespace-Pt
  • 1,604
  • 1
  • 14
  • 25

1 Answers1

1

Assuming your loss is mean-reduced, then you need to scale the loss by 1/accumulate_step

The default behavior of most loss functions is to return the average loss value across each batch element. This is referred to as mean-reduction, and has the property that batch size does not affect the magnitude of the loss (or the magnitude of the gradients of loss). However, when implementing gradient accumulation, each time you call backward you are adding gradients to the existing gradients. Therefore, if you call backward four times on quarter-sized batches, you are actually producing gradients that are four-times larger than if you had called backward once on a full-sized batch. To account for this behavior you need to divide the gradients by accumulate_step, which can be accomplished by scaling the loss by 1/accumulate_step before back-propagation.

loss = model(x) / accumulate_step

loss.backward()

# step starts from 1
if (step % accumulate_step == 0) or (step == len(dataloader)):

    if clip_grad_norm > 0:
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=clip_grad_norm)

    optimizer.step()
    if scheduler:
        scheduler.step()

    optimizer.zero_grad()
jodag
  • 19,885
  • 5
  • 47
  • 66
  • Thank you. I used cross-entropy and it is mean-reduced. I tried to scale it by 4. But the margin exists. Any possible reasons? – namespace-Pt Sep 25 '22 at 15:56
  • You will need to post a minimal but working example that demonstrates the issue. There are many reason why performance could change. Some layers do things that are dependent on batch size (like batch normalization), and those side effects are not accounted for by simple gradient accumulation. – jodag Sep 25 '22 at 15:58
  • I'm just running a bert. I'm double checking my code. – namespace-Pt Sep 25 '22 at 16:10
  • It's because the `Dropout` layer. Even though the random seeds for two runs are the same, for the `batch_size==32`, the model go through the dropout layer in parallel. But for the `batch_size==8, accumulation==4`, the model go though the dropout layer four times. – namespace-Pt Sep 25 '22 at 17:31