Different Python minimization functions give different values, Why?

Question

I’m trying to learn python by rewriting Andrew Ng’s Machine learning course assignments from Octave (I took the classed and got the certificate). I’m having issues with the optimization functions. In the course they use fmincg which is a function used in Octave to minimize a the cost function (convex functions) of linear regression providing its derivative. They also teach you how to use gradient descent and the normal equation, which in theory they all give you the same result (within a few decimal places) if they’ve been used correctly. They all work great for linear regression and I do get the same results in python. To be clear I’m trying to minimize the cost function to find the best fitting parameters (theta) of the data set. So far I’ve used ‘nelder-mead’ which doesn’t need the derivative and it gave me the closest looking solution to what they have. I’ve also tried ‘TNC’, ‘CG’ and ‘BFGS’, which all require a derivative to minimize the function. They all work great when I have first order polynomial (linear) but when I increase the order of the polynomial to something non-linear and in my case I have x^1 up to x^8, then I can’t get my function to fit the data set. The exercise I’m doing is really simple, I have 12 data points so putting an 8th order polynomial should capture every single point (if you’re curious it’s an example of high variance i.e. overfitting the data). The solution they show, is a line that goes through all the data points as expected and captures everything. The best I got was when I used ‘nelder-mead’ method and it only captured two point out of the data sets, while the rest of the minimization functions didn’t even give me anything close to what I’m looking for. I’m not sure what’s wrong because my cost function and gradients are giving the right values for the linear case so I’m assuming they’re working fine (the exact answer of Octave).

I’m going to list the the functions both in Octave and python in hope someone can explain to me why I’m getting the different answers. Or point out the obvious error that I’m not seeing.

function [J, grad] = linearRegCostFunction(X, y, theta, lambda)
%LINEARREGCOSTFUNCTION Compute cost and gradient for regularized linear 
%regression with multiple variables
%   [J, grad] = LINEARREGCOSTFUNCTION(X, y, theta, lambda) computes the 
%   cost of using theta as the parameter for linear regression to fit the 
%   data points in X and y. Returns the cost in J and the gradient in grad


m = length(y); % number of training examples 
J = 0;
grad = zeros(size(theta));

htheta = X * theta;
n = size(theta);
J = 1 / (2 * m) * sum((htheta - y) .^ 2) + lambda / (2 * m) * sum(theta(2:n) .^ 2);

grad = 1 / m * X' * (htheta - y);
grad(2:n) = grad(2:n) + lambda / m * theta(2:n); # we leave the bias nice 
grad = grad(:);

end

Here is a snippets of my code and if anyone likes the full code, I can provide that as well:

def costFunction(theta, Xcost, y, lmda):
    m = len(y)
    theta = theta.reshape((len(theta),1))
    htheta = np.dot(Xcost,theta) - y 
    J = 1 / (2 * m) * np.dot(htheta.T,htheta) + lmda / (2 * m) * np.sum(theta[1:,:]**2)
    return J

def gradCostFunc(gradtheta, X, y, lmda):
    m = len(y)
    gradtheta = gradtheta.reshape((len(gradtheta),1))
    hgradtheta = np.dot(X,gradtheta) - y 
    #gradtheta[0,0] = 0. 

    grad = (1 / m) * np.dot(X.T, hgradtheta)

    #for i in range(1,len(grad)):
    grad[1:,0] = grad[1:,0] + (lmda/m) * gradtheta[1:,0]
    return grad.reshape((len(grad)))

def normalEqn(X, y, lmda):
    e = np.eye(X.shape[1])
    e[0,0] = 0
    theta = np.dot(np.linalg.pinv(np.dot(X.T,X) + lmda * e),np.dot(X.T,y))
    return theta 

def gradientDescent(X, y, theta, alpha, lmda, num_iters):
    # calculate gradient descent in an iterative manner
    m = len(y)
    # J_history tracks the evolution of the cost function 
    J_history = np.zeros((num_iters,1))

    # Calculating the gradients 
    for i in range(0, num_iters):
        grad = np.zeros((len(theta),1))
        grad = gradCostFunc(theta, X, y, lmda)
        #updating the thetas 
        theta = theta - alpha * grad 
        J_history[i] = costFunction(theta, X, y, lmda)

    plt.plot(J_history)
    plt.show()

    return theta 

def trainLR(initheta, X, y, lmda):
    #print theta.shape, X.shape, y.shape, gradtest.shape gradCostFunc
    options = {'maxiter': 1000}
    res = optimize.minimize(costFunction, initheta, jac=gradCostFunc, method='CG',                            args=(X, y, lmda), options = options)
    #res = optimize.minimize(costFunction, theta, method='nelder-mead',                             args=(X,y,lmda), options={'disp': False})
    #res = optimize.fmin_bfgs(costFunction, theta, fprime=gradCostFunc, args=(X, y, lmda))
    return res.x

def polyFeatures(X, degree):
    # map the higher polynomials 
    out = X 
    if degree >= 2:
        for i in range(2,degree+1):
            out = np.column_stack((out,X**i))
        return out 
    else:
        return out

def featureNormalize(X):
    # Since the values will vary by orders of magnitudes 
    # It’s important to normalize the various features 
    mu = np.mean(X, axis=0)
    S1 = np.std(X, axis=0)
    return mu, S1, (X - mu)/S1

And here is the main call for these function:

X, y, Xval, yval, Xtest, ytest = loadData('ex5data1.mat')
X_poly = X # to be used in the later on in the program 
p = 8 
X_poly = polyFeatures(X_poly, p)
mu, sigma, X_poly = featureNormalize(X_poly)
X_poly = padding(X_poly)
theta = np.zeros((X_poly.shape[1],1))
theta = trainLR(theta, X_poly, y, 0.)
#theta = normalEqn(X_poly, y, 0.)
#theta = gradientDescent(X_poly, y, theta, 0.1, 0, 1500)

why not comparing your results in each step to your octave's correct result? You can print the intermediate result of your cost function and grad cost function. — lennon310, Dec 20 '13 at 23:00

score 0 · Accepted Answer · answered Dec 20 '13 at 21:25

0

My answer is probably off point, because your question was for help debugging your current implementation.

That said, if you're interested in using ready-made optimisers in Python then have a look at OpenOpt. The library contains reasonably performant implementations of optimisers for a wide variety of optimisation problems.

I should also mention that the scikit-learn library provides a nice Machine Learning toolset for Python.

answered Dec 20 '13 at 21:25

Rob

1,350
12
21

If you can help in finding the bug that's great but all what I'm asking here why I'm getting the different answer for the different functions? – Henry80s Dec 20 '13 at 21:36
@Henry80s: I think I misread your original question. Are you (a) getting different answers when using Octave or Python, or (b) getting unexpected results for higher-order polynomial fits? – Rob Dec 20 '13 at 21:43
All what I'm asking here is why I'm getting the different answer for the different functions? Also, do my costFunction and gradCostFunction look correct? – Henry80s Dec 20 '13 at 21:54
I'm only getting this when I got to higher order polynomials but not with the linear case – Henry80s Dec 20 '13 at 21:55
As you're not interested in regularising this yet, one thing to try would be to remove the regularisation terms temporarily. That will ensure that there is no penalty on overfitting, so there's a better chance of fitting through all the points. I'm also curious about the dimensions of your X matrix and theta vector. Do both have the polynomial order as one of their dimensions, and does one of the entries in X consist of [1 x x^2 ... x^8]? Thanks. – Rob Dec 20 '13 at 22:26
The issue has been resolved, I've made a mistake somewhere else in the code. – Henry80s Apr 22 '14 at 02:31

Different Python minimization functions give different values, Why?

1 Answers1