Stochastic Gradient Descent Convergence Criteria

Question

Currently my convergence criteria for SGD checks whether the MSE error ratio is within a specific boundary.

def compute_mse(data, labels, weights):
    m = len(labels)
    hypothesis = np.dot(data,weights)
    sq_errors = (hypothesis - labels) ** 2
    mse = np.sum(sq_errors)/(2.0*m)
    return mse

cur_mse = 1.0
prev_mse = 100.0
m = len(labels)
while cur_mse/prev_mse < 0.99999:
    prev_mse = cur_mse

    for i in range(m):
        d = np.array(data[i])
        hypothesis = np.dot(d, weights)
        gradient = np.dot((labels[i] - hypothesis), d)/m
        weights = weights + (alpha * gradient)

    cur_mse = compute_mse(data, labels, weights)
    if cur_mse > prev_mse:
        return

The weights are update w.r.t. to a single data point in the training set.

With an alpha of 0.001, the model is supposed to have converged within a few iterations however I get no convergence. Is this convergence criteria too strict?

It might be that you are not computing the gradient properly. I don't see anywhere the gradient calculation. — Imanol Luengo, Mar 24 '15 at 10:22

Imanol Luengo · Answer 1 · 2015-03-25T17:12:10.940

I'll try to answer the question. First, the pseudocode of stochastic gradient descent looks something like this:

input: f(x), alpha, initial x (guess or random)
output: min_x f(x) # x that minimizes f(x)

while True:
    shuffle data # good practice, not completely needed
    for d in data:
        x -= alpha * grad(f(x)) # df/dx
    if <stopping criterion>:
        break

There can be other regularization parameters added to the function that you want to minimize, such as the l1 penalty to avoid overfitting.

Going back to your problem, looking at your data and definition of the gradient, looks like you want to solve a simple linear system of equations of the form:

Ax = b

which yields the objevtive function:

f(x) = ||Ax - b||^2

stochastic gradient descent uses one row data at a time:

||A_i x - b||

where || o || is the euclidean norm and _i means index of a row.

Here, A is your data, x is your weights and b is your labels.

The gradient of the function is then computed as a:

grad(f(x)) = 2 * A.T (Ax - b)

Or in the case of the stochastic gradient descent:

2 * A_i.T (A_i x - b)

where .T means transpose.

Putting everything back into your code... first I will setup a synthetic data:

A = np.random.randn(100, 2) # 100x2 data
x = np.random.randn(2, 1) # 2x1 weights
b = np.random.randint(0, 2, 100).reshape(100, 1) # 100x1 labels
b[b == 0] = -1 # labels in {-1, 1}

Then, define the parameters:

alpha = 0.001
cur_mse = 100.
prev_mse = np.inf
it = 0
max_iter = 100
m = A.shape[0]
idx = range(m)

And loop!

while cur_mse/prev_mse < 0.99999 and it < max_iter:
    prev_mse = cur_mse
    shuffle(idx)

    for i in idx:
        d = A[i:i+1]
        y = b[i:i+1]
        h = np.dot(d, x)
        dx = 2 * np.dot(d.T, (h - y))
        x -= (alpha * dx)

    cur_mse = np.mean((A.dot(x) - b)**2)
    if cur_mse > prev_mse:
        raise Exception("Not converging")
    it += 1

This code is pretty much the same as yours, with a couple of additions:

Another stopping criterion based on the number of iterations (to avoid looping forever if the system doesn't converge or does too slowly)
Redefinition of the gradient dx (still similar to yours). You have the sign inverted and therefore the weight update is positive + since in my example is negative - (makes sense since you are going down in a gradient).
Indexing of data and labels. While data[i] gives a tuple of size (2,) (in this case for a 100x2 data), using fancy indexing data[i:i+1] will return a view of the data without reshaping it (e.g with shape (1, 2)) and therefore will allow you to perform the proper matrix multiplications.

You can add a 3rd stopping criterion based on acceptable mse error, i.e: if cur_mse < 1e-3: break.

This algorithm, with random data, converges in 20-40 iterations for me (depending on the generated random data).

So... assuming that this is the function you want to minimize, if this method doesn't work for you, it might mean that your system is underdeterminated (you have less training data than features, which means A is more wide than high).

Hope it helps!

@illuengo, thanks that's helpful but would the indexing of data and labels affect the convergence in any way? Also, my data is definitely not more wide than high. It's made up of 4601 point each with 57 features. — indecisivecoder, Mar 25 '15 at 15:11
Well, it does affect since with my data your algorithm doesn't work and tells me an error `ValueError: shapes (1,) and (2,) not aligned: 1 (dim 0) != 2 (dim 0)` when trying to compute the gradient. That is because my `data`, `labels` and `weights` are all of them 2D matrices. If you make `weights` and `labels` 1D then your code works fine (but I found a nice pracctice to stick to matrix notations when computing algrebraic operations). — Imanol Luengo, Mar 25 '15 at 17:26
Anyway, whats is wrong with your code is not that, the problem you have is: 1) `cur_mse` should be set up to something high (say 100 or 100000) since is going to be assigned to `prev_mse` in the first instruction. 2) `prev_mse` should be set to something even higher (like `np.inf`). And 3) and most important, your gradient computation is not correct, remove the `/ m` and add `2 * ` (for faster convergence) — Imanol Luengo, Mar 25 '15 at 17:28
One last note, `shuffle` is imported from `random` as `from random import shuffle`, in case you want to add that to your code. — Imanol Luengo, Mar 25 '15 at 18:26
Thanks for the help but the problem was in the hypothesis function. The first parameter in each data point needs to be 1 to take into account the constant term and correctly compute the hypothesis. — indecisivecoder, Apr 04 '15 at 17:42

Stochastic Gradient Descent Convergence Criteria

1 Answers1