Wouldn't setting the first derivative of Cost function J to 0 gives the exact Theta values that minimize the cost?

Question

I am currently doing Andrew NG's ML course. From my calculus knowledge, the first derivative test of a function gives critical points if there are any. And considering the convex nature of Linear / Logistic Regression cost function, it is a given that there will be a global / local optima. If that is the case, rather than going a long route of taking a miniscule baby step at a time to reach the global minimum, why don't we use the first derivative test to get the values of Theta that minimize the cost function J in a single attempt , and have a happy ending?

That being said, I do know that there is a Gradient Descent alternative called Normal Equation that does just that in one successful step unlike the former.

On a second thought, I am thinking if it is mainly because of multiple unknown variables involved in the equation (which is why the Partial Derivative comes into play?) .

Because there is not closed form solution to it and/or using the Normal equations is computationally very expensive with a lot of data. — ilanman, Dec 21 '16 at 14:02

score 0 · Answer 1 · answered Feb 10 '17 at 02:38

Let's take an example:

Gradient simple regression cost function:

Δ[RSS(w)  = [(y-Hw)T(y-Hw)]
y  : output 
H  : feature vector
w  : weights
RSS: residual sum of squares

Equating this to 0 for getting the closed form solution will give:

w = (H ^T H)-1 H^T y

Now assuming there are D features, the time complexity for calculating transpose of matrix is around O(D³). If there are a million features, it is computationally impossible to do within reasonable amount of time.

We use these gradient descent methods since they give solutions with reasonably acceptable solutions within much less time.

Wouldn't setting the first derivative of Cost function J to 0 gives the exact Theta values that minimize the cost?

1 Answers1