Stochastic gradient descent and performance

Question

I'm trying to train a classifier with the MNIST set (a set of handwritten digits) and I want to implement a stochastic gradient descent algorithm. Here is the function I wrote:

def Stochastic_gradient_descent(theta, y, X, alpha, nIter):
    costs = numpy.zeros([nIter, 1])
    N = y.size
    for i in range(nIter):
        random = randint(0,49999)
        theta -= alpha*(tls.h(theta, X)[random] - y[random])*X[[random],:].T
        costs[i] = (1/N)*tls.cost(theta, y, X)
    return theta, costs

alpha is the length of the step

h is the sigmoid function of transpose(theta).X

X is a 50000*785 where 50000 is the size of the training set and 785 = (size of my image) + 1 (for the constant theta0)

This function runs in roughly 9 seconds for 100 iterations (nIter), so for 100*1*785 multiplications. The classifiers I found are satisfying. I wanted to compare this running time with a gradient descent algorithm where:

theta -= alpha * (1/N) * (numpy.dot((tls.h(theta, X) - y).T, X)).T

This function runs in roughly 12 seconds for 100 iterations (nIter), so for 100*50000*785 multiplications as (h(theta,X)-y) is a 50000*1 vector. The classifiers I found are also satisfying but I am surprised because this code is not much slower than the first one. I understand vectorization plays an important role in the dot function but I would have expected a worse performance. Is there a way to improve the performance of my stochastic gradient descent?

Thank you for your help.

score 1 · Answer 1 · answered Feb 07 '16 at 02:17

As far as I'm concerned, vectorization is the simplest way to improve the performance of SGD. There are some other things you can try. For instance coding a Cython version, using minibatches of several samples (they tend to average the "noise" of the single samples) or simply you can try different stopping criteria using: early-stopping, close-enough-to-zero, threshold-stopping,...

If your aim is to implement some ML learning algorithms or optimizations functions to learn about it as a first contact, then perfect. Keep working. But if you want to work in a professional way you should use the already-optimize (and well-tested)libraries.

P.S. Libraries like Caffe, Torch, Theano, Neon (Nirvana),... have some really complex and magical optimizations that allow them to get some really high performance beside the GPU support.

Benchmark of the ImageNet winner models coded in some of the most popular libraries: https://github.com/soumith/convnet-benchmarks

Thanks for your answer. Indeed I am just familiarising with machine learning but will try to test these libraries. I have two questions: if I use minibatches (let's say size 10), will I really see an improvement using vectorization? As I don't see a true improvement in my case (minibatch size = 1), I was wondering. Also, is there something I can do to improve the performance of my specific code, with this minibatch size = 1? Do we know how the "constant*vector" is computed? It may sound stupid but are coefficients multiplied by the constant one after the other? Or simultaneously? — Petreius, Feb 07 '16 at 11:44

Stochastic gradient descent and performance

1 Answers1