I'm studying SGD with momentum and have come across two versions of the update formula.
The first is from a wiki:
dw = a * dw - lr * dL/dw # w: weights; lr: learning rate; dL/dw: drivatives of loss function over w
w := w + dw
The second version is more common:
dw = a * dw - (1 - a) * dL/dw
w := w + dw
My question is: why must the coefficient for the dL/dw
term be (1-a)
? It seems to me that even if lr != (1 - a)
, this would still make sense.
Is there any specific reason to choose this coefficient?
I asked chatgpt and it tells me the first version is not correct but not providing any reason.