-2

I'm studying SGD with momentum and have come across two versions of the update formula.

The first is from a wiki:

dw = a * dw - lr * dL/dw # w: weights; lr: learning rate; dL/dw: drivatives of loss function over w
w := w + dw

The second version is more common:

dw = a * dw - (1 - a) * dL/dw
w := w + dw

My question is: why must the coefficient for the dL/dw term be (1-a)? It seems to me that even if lr != (1 - a), this would still make sense.
Is there any specific reason to choose this coefficient?

I asked chatgpt and it tells me the first version is not correct but not providing any reason.

Ronald
  • 31
  • 3
W Lewis
  • 1
  • 3

1 Answers1

1

You should think of the gradients dL/dw as a sequence of numbers, then idea of momentum is to keep track of a moving average of the sequence dL/dw, and step in the direction of this average, not in the direction of the actual gradient, which can be noisy since it is calculated from a single exmaple/ a small batch.

The formula dw = a * dw + (1 - a) * dL/dw represents a way to easily calculate the Exponential moving average (EMA) at each step. To answer your question: if the coefficient of the second term wasn't (1-a), then the result can't sensibly be described as an average.

  • A simple demonstration would be to check what happens if dL/dw is a constant sequence,
    • (1-a) implies dw will be constant and equal to dL/dw
    • anything else and dw will either converge to zero or diverge

If you want to alter the learning rate you should use:

dw = a * dw + (1 - a) * dL/dw
w := w - lr * dw
Ronald
  • 31
  • 3
  • Thank you for your response! Viewing `dw` as an Exponential Moving Average (EMA) offers an insightful perspective. From what I can gather, a conventional EMA takes the form of `dw = a * dw + (1 - a) * dL/dw` not `dw = a * dw - (1 - a) * dL/dw `. While I do agree that for `lr < (1-a)`, convergence to 0 is likely, this seems reasonable for various other learning rate selections as well. And I believe that when `lr < (1-a)`, divergence is possible (not necessarily), yet this issue is akin to overshooting. – W Lewis Aug 31 '23 at 01:50
  • As this question is closed, I reopen this on [StackExchange](https://math.stackexchange.com/questions/4761231/coefficient-for-the-gradient-term-in-stochastic-gradient-descent-sgd-with-mome) – W Lewis Aug 31 '23 at 01:54
  • Oh right yes, definitely with a plus `+` not a `-`. (I copied it from your question without thinking) Can you link where have you seen it with a `-`? – Ronald Aug 31 '23 at 07:48
  • 1
    Acctually the formation should be `dw = a * dw + (1 - a) * dL/dw` and `w := w - dw`. I made a little modification to facilitate a clear comparison between these two versions. – W Lewis Aug 31 '23 at 08:39