Why my Gradient is wrong (Coursera, Logistic Regression, Julia)?

Question

I'm trying to do Logistic Regression from Coursera in Julia, but it doesn't work.

The Julia code to calculate the Gradient:

sigmoid(z) = 1 / (1 + e ^ -z)

hypotesis(theta, x) = sigmoid(scalar(theta' * x))

function gradient(theta, x, y)
    (m, n) = size(x)
    h = [hypotesis(theta, x[i,:]') for i in 1:m]
    g = Array(Float64, n, 1)
    for j in 1:n
        g[j] = sum([(h[i] - y[i]) * x[i, j] for i in 1:m])
    end
    g
end

If this gradient used it produces the wrong results. Can't figure out why, the code seems like the right one.

The full Julia script. In this script the optimal Theta calculated using my Gradient Descent implementation and using the built-in Optim package, and the results are different.

Vincent Zoonekynd · Answer 1 · 2016-05-07T02:36:54.350

The gradient is correct (up to a scalar multiple, as @roygvib points out). The problem is with the gradient descent.

If you look at the values of the cost function during your gradient descent, you will see a lot of NaN, which probably come from the exponential: lowering the step size (e.g., to 1e-5) will avoid the overflow, but you will have to increase the number of iterations a lot (perhaps to 10_000_000).

A better (faster) solution would be to let the step size vary. For instance, one could multiply the step size by 1.1 if the cost function improves after a step (the optimum still looks far away in this direction: we can go faster), and divide it by 2 if it does not (we went too fast and ended up past the minimum).

One could also do a line search in the direction of the gradient to find the best step size (but this is time-consuming and can be replaced by approximations, e.g., Armijo's rule).

Rescaling the predictive variables also helps.

roygvib · Accepted Answer · 2016-05-07T14:49:56.230

I tried comparing gradient() in the OP's code with numerical derivative of cost_j() (which is the objective function of minimization) using the following routine

function grad_num( theta, x, y )
    g = zeros( 3 )

    eps = 1.0e-6
    disp = zeros( 3 )

    for k = 1:3
        disp[:] = theta[:]

        disp[ k ]= theta[ k ] + eps
        plus = cost_j( disp, x, y )
        disp[ k ]= theta[ k ] - eps
        minus = cost_j( disp, x, y )

        g[ k ] = ( plus - minus ) / ( 2.0 * eps )
    end
    return g
end

But the gradient values obtained from the two routines do no seem to agree very well (at least for the initial stage of minimization)... So I manually derived the gradient of cost_j( theta, x, y ), from which it seems that the division by m is missing:

#/ OP's code
# g[j] = sum( [ (h[i] - y[i]) * x[i, j] for i in 1:m ] )

#/ modified code
  g[j] = sum( [ (h[i] - y[i]) * x[i, j] for i in 1:m ] ) / m

Because I am not very sure if the above code and expression are really correct, could you check them by yourself...?

But in fact, regardless of whether I use the original or corrected gradients, the program converges to the same minimum value (0.2034977016, almost the same as obtained from Optim), because the two gradients differ only by a multiplicative factor! Because the convergence was very slow, I also modified the stepsize alpha adaptively following the suggestion by Vincent (here I used more moderate values for acceleration/deceleration):

function gradient_descent(x, y, theta, alpha, n_iterations)
    ... 
    c = cost_j( theta, x, y )

    for i = 1:n_iterations
        c_prev = c
        c = cost_j( theta, x, y )

        if c - c_prev < 0.0
            alpha *= 1.01
        else
            alpha /= 1.05
        end
        theta[:] = theta - alpha * gradient(theta, x, y)
    end
    ...
end

and called this routine as

optimal_theta = gradient_descent( x, y, [0 0 0]', 1.5e-3, 10^7 )[ 1 ]

The variation of cost_j versus iteration steps is plotted below.

Thanks, yes, you are correct, the gradient should be divided by `m`. The strange thing though is that in the original example made with the Octave the gradient expected to be converging about 400 iterations. I will double check the original code, maybe they use some tricks to speed it up. — Alexey Petrushin, May 12 '16 at 07:45

Why my Gradient is wrong (Coursera, Logistic Regression, Julia)?

2 Answers2