I have just started studying nueral networks and I managed to figure out how to derive the equations necessary for back propagation. I've spent nearly 3 days asking all of my professors and googling everything I can find. My math skills are admittedly poor but I really want to understand how this particular formula mathematically makes sense. The formula is used to update the weight after the gradient has already been found.
W1 = W0 - L * (dC/dw)
Where:
W1 = new weight
W0 = old weight
L = learning rate
dC/dw = the partial derivative of error function and a member of the gradient vector of the Cost function
What I know so far:
- The gradient is a vector of it's partial derivatives and the maximum rate of increase is given by the gradient itself. Each partial derivative gives the maximum rate of change in the direction that the derivative is taken with respect to.
- dC/dW is one of these partial derivatives.
- dC/dW evaluates to a rate of change. It's sign can tell us the direction of change. The value itself is the proportion between change in Cost and change in weight at a particular weight.
- Somehow multiplying dC/dW by the learning rate is only taking a small portion of this rate as the change in weight.
What I can't reconcile:
- The learning rate is just a scalar without units. How is it possible to just multiply a scalar by a rate and end up with a measurable change in weight? What am I failing to understand here?