15

I learnt gradient descent through online resources (namely machine learning at coursera). However the information provided only said to repeat gradient descent until it converges.

Their definition of convergence was to use a graph of the cost function relative to the number of iterations and watch when the graph flattens out. Therefore I assume that I would do the following:

if (change_in_costfunction > precisionvalue) {
          repeat gradient_descent
} 

Alternatively, I was wondering if another way to determine convergence is to watch the coefficient approach it's true value:

if (change_in_coefficient_j > precisionvalue) {
          repeat gradient_descent_for_j
} 
...repeat for all coefficients

So is convergence based on the cost function or the coefficients? And how do we determine the precision value? Should it be a % of the coefficient or total cost function?

Terence Chow
  • 10,755
  • 24
  • 78
  • 141

2 Answers2

16

You can imagine how Gradient Descent (GD) works thinking that you throw marble inside a bowl and you start taking photos. The marble will oscillate till friction will stop it in the bottom. Now imaging that you are in an environment that friction is so small that the marble takes a long time to stop completely, so we can assume that when the oscillations are small enough the marble have reach the bottom (although it could continue oscillating). In the following image you could see the first eight steps (photos of the marble) of the GD.

enter image description here

If we continue taking photos the marble makes not appreciable movements, you should zoom the image:

enter image description here

We could keep taking photos and the movements will be more irrelevants.

So reaching a point in which GD makes very small changes in your objective function is called convergence, which doesn't mean it has reached the optimal result (but it is really quite quite near, if not on it).

The precision value can be chosen as the threshold in which you consecutive iterations of GD are almost the same:

grad(i) = 0.0001
grad(i+1) = 0.000099989 <-- grad has changed less than 0.01% => STOP
jabaldonedo
  • 25,822
  • 8
  • 77
  • 77
  • I'm accepting your answer, but you didn't make it clear if GD is of the cost function or the coefficient. The comment by Thomas Jungblut says it is convergence of coefficients which will reflect in the cost function so it sounds to me like 'it doesn't matter'...Thanks for the detailed answer though! – Terence Chow Jun 25 '13 at 13:44
  • GD is a general algorithm for finding the minimun in a convex function. That function can be the cost function of a ML problem or any other function. – jabaldonedo Jun 25 '13 at 15:26
  • I also have some confusion about this, and still cannot find clear answer as this step (check for convergence) is missed in all articles I found so far. We can compute cost function in each step to see if it changes a lot from step to step. But calculating cost function may be expensive too. There is stochastic gradient descent algorithm where we can use part of data to calculate descent itself, but we still need all data to calculate cost function? It's still unclear to me. – Vadim May 04 '16 at 05:48
1

I think I understand your question. Based on my understanding, the GD function is based on the cost function. It iterates until the convergence of cost function.

Imagine, plotting a graph of cost function (y-axis) against the number of iterations of GD(x-axis). Now, if the GD works properly the curve is concave up, or decreasing(similar to that of 1/x). Since, the curve is decreasing, the decrease in cost function becomes smaller and smaller, and then there comes a point where the curve is almost flattened. Around that point, we say the GD is more or less converged (again, where the cost function decreases by a unit less than the precision_value).

So, I would your first approach is what you need:

(if(change_in_costFunction > precision_value))

repeat GD;