I recently watched Andrew Ng's video on SGDM. I understand that the momentum term updates the gradient by weighting the last gradient and using a small component of V_dw. I don't understand why momentum is also known as exponentially weighted average. Also, in Ng's video at 6:37 he says using Beta = 0.9 effectively means using an average of the last 10 gradients. Can someone explain how that works? To me, it's just a scalar weighting of 1-0.9 to all the gradients in the vector dW.
Appreciate any insight! I feel like I'm missing something fundamental.