I’m trying to learn python by rewriting Andrew Ng’s Machine learning course assignments from Octave (I took the classed and got the certificate). I’m having issues with the optimization functions. In the course they use fmincg which is a function used in Octave to minimize a the cost function (convex functions) of linear regression providing its derivative. They also teach you how to use gradient descent and the normal equation, which in theory they all give you the same result (within a few decimal places) if they’ve been used correctly. They all work great for linear regression and I do get the same results in python. To be clear I’m trying to minimize the cost function to find the best fitting parameters (theta) of the data set. So far I’ve used ‘nelder-mead’ which doesn’t need the derivative and it gave me the closest looking solution to what they have. I’ve also tried ‘TNC’, ‘CG’ and ‘BFGS’, which all require a derivative to minimize the function. They all work great when I have first order polynomial (linear) but when I increase the order of the polynomial to something non-linear and in my case I have x^1 up to x^8, then I can’t get my function to fit the data set. The exercise I’m doing is really simple, I have 12 data points so putting an 8th order polynomial should capture every single point (if you’re curious it’s an example of high variance i.e. overfitting the data). The solution they show, is a line that goes through all the data points as expected and captures everything. The best I got was when I used ‘nelder-mead’ method and it only captured two point out of the data sets, while the rest of the minimization functions didn’t even give me anything close to what I’m looking for. I’m not sure what’s wrong because my cost function and gradients are giving the right values for the linear case so I’m assuming they’re working fine (the exact answer of Octave).
I’m going to list the the functions both in Octave and python in hope someone can explain to me why I’m getting the different answers. Or point out the obvious error that I’m not seeing.
function [J, grad] = linearRegCostFunction(X, y, theta, lambda)
%LINEARREGCOSTFUNCTION Compute cost and gradient for regularized linear
%regression with multiple variables
% [J, grad] = LINEARREGCOSTFUNCTION(X, y, theta, lambda) computes the
% cost of using theta as the parameter for linear regression to fit the
% data points in X and y. Returns the cost in J and the gradient in grad
m = length(y); % number of training examples
J = 0;
grad = zeros(size(theta));
htheta = X * theta;
n = size(theta);
J = 1 / (2 * m) * sum((htheta - y) .^ 2) + lambda / (2 * m) * sum(theta(2:n) .^ 2);
grad = 1 / m * X' * (htheta - y);
grad(2:n) = grad(2:n) + lambda / m * theta(2:n); # we leave the bias nice
grad = grad(:);
end
Here is a snippets of my code and if anyone likes the full code, I can provide that as well:
def costFunction(theta, Xcost, y, lmda):
m = len(y)
theta = theta.reshape((len(theta),1))
htheta = np.dot(Xcost,theta) - y
J = 1 / (2 * m) * np.dot(htheta.T,htheta) + lmda / (2 * m) * np.sum(theta[1:,:]**2)
return J
def gradCostFunc(gradtheta, X, y, lmda):
m = len(y)
gradtheta = gradtheta.reshape((len(gradtheta),1))
hgradtheta = np.dot(X,gradtheta) - y
#gradtheta[0,0] = 0.
grad = (1 / m) * np.dot(X.T, hgradtheta)
#for i in range(1,len(grad)):
grad[1:,0] = grad[1:,0] + (lmda/m) * gradtheta[1:,0]
return grad.reshape((len(grad)))
def normalEqn(X, y, lmda):
e = np.eye(X.shape[1])
e[0,0] = 0
theta = np.dot(np.linalg.pinv(np.dot(X.T,X) + lmda * e),np.dot(X.T,y))
return theta
def gradientDescent(X, y, theta, alpha, lmda, num_iters):
# calculate gradient descent in an iterative manner
m = len(y)
# J_history tracks the evolution of the cost function
J_history = np.zeros((num_iters,1))
# Calculating the gradients
for i in range(0, num_iters):
grad = np.zeros((len(theta),1))
grad = gradCostFunc(theta, X, y, lmda)
#updating the thetas
theta = theta - alpha * grad
J_history[i] = costFunction(theta, X, y, lmda)
plt.plot(J_history)
plt.show()
return theta
def trainLR(initheta, X, y, lmda):
#print theta.shape, X.shape, y.shape, gradtest.shape gradCostFunc
options = {'maxiter': 1000}
res = optimize.minimize(costFunction, initheta, jac=gradCostFunc, method='CG', args=(X, y, lmda), options = options)
#res = optimize.minimize(costFunction, theta, method='nelder-mead', args=(X,y,lmda), options={'disp': False})
#res = optimize.fmin_bfgs(costFunction, theta, fprime=gradCostFunc, args=(X, y, lmda))
return res.x
def polyFeatures(X, degree):
# map the higher polynomials
out = X
if degree >= 2:
for i in range(2,degree+1):
out = np.column_stack((out,X**i))
return out
else:
return out
def featureNormalize(X):
# Since the values will vary by orders of magnitudes
# It’s important to normalize the various features
mu = np.mean(X, axis=0)
S1 = np.std(X, axis=0)
return mu, S1, (X - mu)/S1
And here is the main call for these function:
X, y, Xval, yval, Xtest, ytest = loadData('ex5data1.mat')
X_poly = X # to be used in the later on in the program
p = 8
X_poly = polyFeatures(X_poly, p)
mu, sigma, X_poly = featureNormalize(X_poly)
X_poly = padding(X_poly)
theta = np.zeros((X_poly.shape[1],1))
theta = trainLR(theta, X_poly, y, 0.)
#theta = normalEqn(X_poly, y, 0.)
#theta = gradientDescent(X_poly, y, theta, 0.1, 0, 1500)