3

In Machine learning regression problem, why the local minimum is computed for a derivative function instead of the actual function?

Example: http://en.wikipedia.org/wiki/Gradient_descent

The gradient descent algorithm is applied to find a local minimum of the function $$

f(x)=x^4−3x^3+2, ----(A)

with derivative

f'(x)=4x^3−9x^2. ----(B)

Here to find the local minimum using gradient descent algorithm for the function (A) they have used the derivative function of (A) which is function (B).

user703555
  • 265
  • 1
  • 7
  • 14

2 Answers2

3

The reason is that because the function is concave (or convex if you're doing maximisation---these problems are equivalent), you know that there is a single minimum (maximum). This means that there is a single point where the gradient is equal to zero. There are techniques that use the function itself, but if you can compute the gradient, you can converge much faster because you can think of the gradient giving you information about how far you are from the optimum.

As well as Gradient Descent, there's an optimisation method known as Newton's method, which requires computation of the second derivative (the Hessian in multi-variate optimisation). This converges even faster still, but requires you to be able to invert the Hessian which is not feasible if you have a lot of parameters. So there are methods to get around this which compute a limited memory approximation of the Hessian. These methods converge faster still because they're using information about the curvature of the gradient: it's a simple tradeoff, where the more you know about the function you're trying to optimise, the faster you can find the solution.

Ben Allison
  • 7,244
  • 1
  • 15
  • 24
  • 1
    Sorry, i have a very silly question. f'(x)=4x^3−9x^2 Just by looking at the function we can find the local minimum, i.e. f'(x) = 0 will occur for x=0 then why we need a gradient descent. – user703555 Feb 15 '13 at 09:16
  • Well, it's not quite just by looking at it: you can solve for x when f'(x) = 0. This is a simple example: most of the time when you use optimisation methods, you can't get an analytic solution. Take a look at http://en.wikipedia.org/wiki/Convex_optimization for more details. – Ben Allison Feb 15 '13 at 10:09
  • @BenAllison So it is because of computational cost? I mean if we have a loss function f(x) we want to find that x that makes it minimum. But this mean that we have to check out all possible values of x. Instead we compute gradient at some random initial x and move towards the minimum (according to gradient) saving computational time? – Anton Nov 27 '21 at 13:05
2

I'm not a mathematician - so I can't give you exact answer, however, you need to understand what derivation does, e.g.:

http://en.wikipedia.org/wiki/Derivative http://en.wikipedia.org/wiki/Differential_of_a_function

this is what you need(what differentiation do): http://en.wikipedia.org/wiki/File:Graph_of_sliding_derivative_line.gif

The derivative at a point equals the slope of the tangent line to the graph of the function at that point. And this is exactly what you want when you are looking a descent. Take it as very informal point of view, wikipedia articles will give you much deeper and precise knowledge ...

xhudik
  • 2,414
  • 1
  • 21
  • 39