I'm doing 2 gradient descent iterations (initial condition: learning_rate = 0.1, and [w0,w1] = [0,0]) to find the 2 parameters (y_hat = w0 + w1*x) for linear model that fits a simple dataset, x=[0,1,2,3,4] and y=[0,2,3,8,17]. By using the closed form formula, I found that w0 = -2 and w1 = 4. For the first 2 iterations of the gradient descent, I found that w0 = 0.6, w0 = 0.74. However, I thought that, if no overshooting occurs in the gradient descent, I should expect w0 to decrease for every single iteration given the initial condition of gradient descent and the answer I got from closed form solution. Why would this occur if the error function is a convex function?
-
I suggest making a test where the initial starting parameters are -2.1 and 4.1, that should make solving the problem easier. Starting parameters that close to the correct solution should complete very quickly. – James Phillips Feb 15 '19 at 11:41
1 Answers
you actually misinterpret gradient descent. Gradient descent doesn't say that on every iteration all the weights will move towards their respective optimal values but instead what it is all about is that you always move towards the minimum cost if accurate hyperparameters are supplied. In your case where you initialize the weights with [0,0] there exists no value of alpha which will deny the mentioned effect because when w0 goes from 0.6 to 0.74(consider this a vector) and w1 moves from 2 to 2.68(consider this another vector) then the resultant vector is such that it moves down the hill with the steepest descent and this is what GD accounts for, that is a collective direction of weights moving down the hill of cost function.
And you can verify this by plotting the cost graph and also after the second iteration the b value do move towards -2 because after the second iteration in that direction w0 is steepest.
the below graph is the value of w0 at different iteration, x_axis=w0 and y_axis=iteration_no
Now we can clearly see the little upward notch at the beginning which is mentioned by you as well.
below is the graph of cost at different iterations
And this cost curve you can clearly see that the cost is decreasing at every single iteration, i.e., we are continuously moving down the hill in the steepest direction and this is what gradient descent, this is what the actual job of GD. And Yes we may get such behaviors where our weights move opposite to their required value during learning of our model but because we are moving down the hill we always converge to the minimum and the correct values of our weights.
Now if it still annoys you then the only way to solve this is change your initial values of weights because tuning learning_Rate won't solve this with the [0,0] Initialization.
So initialize with [-0.1,3.1] with the same learning_rate
As now you can clearly see that there are no such upward notch at the beginning because now the cost values decrease in the direction where the weights also move in the direction of their optimum values i.e., [-2,4]
now you can also see that the cost and w0 approach to the required values in fewer iterations as before, this is because now we initialized very close to the required values.
And there are many more such initializations which give you this result.
Conclusion - GD always move in the steepest path down the hill
Happy Machine Learning...

- 138
- 2
- 9
-
Thanks for your explanation. What really bothers me is that given the error function for the linear regression model is a convex function and we start at w0=0 ends w0=-2, shouldn't the 2D graph of w0 VS. error function be a convex parabola which centers at -2? If that's the case, the w0 = 0 should be on the right side of -2, which always has a positive slope. If we start at 0 and the slope is always positive, then we should always move in the opposite direction of the slope to decrease w0. Or the w0 VS. error function graph is actually not a perfect convex function? – kastle Feb 15 '19 at 16:10
-
A convex function always have a global minimum and no local optimum, so if you plot the 3D curve containing w0,w1 and error then at position 0,0 you will notice that although there is downward hill towards the negative side of w0 but the steppest direction is a bit towards the positive side and hence GD goes with the steppest direction, actually the plot is very similar to a valley near 0,0. You get better intiuation what I am saying once you plot the 3D curve and zoom near 0,0 – Gaurav Sharma Feb 16 '19 at 13:40
-
I understand that plotting out a 3D curve of w0 w1 and cost function will definitely help me to clear up my confusion. But I got stuck with plotting such a graph using python and matplotlib after reading some tutorials. Is there a tutorial you would recommend or some other ways other than python to obtain the graph? Thanks. – kastle Feb 21 '19 at 04:46
-
https://stackoverflow.com/questions/28542686/3d-plot-of-the-error-function-in-a-linear-regression , refer this the code in the answer is just needed to be copy paste even if you can't understand it but it's very neat and simple code to understand – Gaurav Sharma Feb 24 '19 at 12:19