I'm trying to train a classifier with the MNIST set (a set of handwritten digits) and I want to implement a stochastic gradient descent algorithm. Here is the function I wrote:
def Stochastic_gradient_descent(theta, y, X, alpha, nIter):
costs = numpy.zeros([nIter, 1])
N = y.size
for i in range(nIter):
random = randint(0,49999)
theta -= alpha*(tls.h(theta, X)[random] - y[random])*X[[random],:].T
costs[i] = (1/N)*tls.cost(theta, y, X)
return theta, costs
alpha is the length of the step
h is the sigmoid function of transpose(theta).X
X is a 50000*785 where 50000 is the size of the training set and 785 = (size of my image) + 1 (for the constant theta0)
This function runs in roughly 9 seconds for 100 iterations (nIter), so for 100*1*785 multiplications. The classifiers I found are satisfying. I wanted to compare this running time with a gradient descent algorithm where:
theta -= alpha * (1/N) * (numpy.dot((tls.h(theta, X) - y).T, X)).T
This function runs in roughly 12 seconds for 100 iterations (nIter), so for 100*50000*785 multiplications as (h(theta,X)-y) is a 50000*1 vector. The classifiers I found are also satisfying but I am surprised because this code is not much slower than the first one. I understand vectorization plays an important role in the dot function but I would have expected a worse performance. Is there a way to improve the performance of my stochastic gradient descent?
Thank you for your help.