Coefficient for the gradient term in stochastic gradient descent (SGD) with momentum

Question

I'm studying SGD with momentum and have come across two versions of the update formula.

The first is from a wiki:

dw = a * dw - lr * dL/dw # w: weights; lr: learning rate; dL/dw: drivatives of loss function over w
w := w + dw

The second version is more common:

dw = a * dw - (1 - a) * dL/dw
w := w + dw

My question is: why must the coefficient for the dL/dw term be (1-a)? It seems to me that even if lr != (1 - a), this would still make sense.
Is there any specific reason to choose this coefficient?

I asked chatgpt and it tells me the first version is not correct but not providing any reason.

Ronald · Answer 1 · 2023-08-31T09:05:23.183

1

You should think of the gradients dL/dw as a sequence of numbers, then idea of momentum is to keep track of a moving average of the sequence dL/dw, and step in the direction of this average, not in the direction of the actual gradient, which can be noisy since it is calculated from a single exmaple/ a small batch.

The formula dw = a * dw + (1 - a) * dL/dw represents a way to easily calculate the Exponential moving average (EMA) at each step. To answer your question: if the coefficient of the second term wasn't (1-a), then the result can't sensibly be described as an average.

A simple demonstration would be to check what happens if dL/dw is a constant sequence,
- (1-a) implies dw will be constant and equal to dL/dw
- anything else and dw will either converge to zero or diverge

If you want to alter the learning rate you should use:

dw = a * dw + (1 - a) * dL/dw
w := w - lr * dw

edited Aug 31 '23 at 09:05

answered Aug 30 '23 at 09:59

Ronald

31
3

Thank you for your response! Viewing `dw` as an Exponential Moving Average (EMA) offers an insightful perspective. From what I can gather, a conventional EMA takes the form of `dw = a * dw + (1 - a) * dL/dw` not `dw = a * dw - (1 - a) * dL/dw `. While I do agree that for `lr < (1-a)`, convergence to 0 is likely, this seems reasonable for various other learning rate selections as well. And I believe that when `lr < (1-a)`, divergence is possible (not necessarily), yet this issue is akin to overshooting. – W Lewis Aug 31 '23 at 01:50
As this question is closed, I reopen this on [StackExchange](https://math.stackexchange.com/questions/4761231/coefficient-for-the-gradient-term-in-stochastic-gradient-descent-sgd-with-mome) – W Lewis Aug 31 '23 at 01:54
Oh right yes, definitely with a plus `+` not a `-`. (I copied it from your question without thinking) Can you link where have you seen it with a `-`? – Ronald Aug 31 '23 at 07:48
1

Acctually the formation should be `dw = a * dw + (1 - a) * dL/dw` and `w := w - dw`. I made a little modification to facilitate a clear comparison between these two versions. – W Lewis Aug 31 '23 at 08:39

Coefficient for the gradient term in stochastic gradient descent (SGD) with momentum

1 Answers1