7

I'm trying to implement a simple regularized logistic regression algorithm in Julia. I'd like to use Optim.jl library to minimize my cost function, but I can't get it to work.

My cost function and gradient are as follows:

function cost(X, y, theta, lambda)
    m = length(y)
    h = sigmoid(X * theta)
    reg = (lambda / (2*m)) * sum(theta[2:end].^2)
    J = (1/m) * sum( (-y).*log(h) - (1-y).*log(1-h) ) + reg
    return J
end

function grad(X, y, theta, lambda, gradient)
    m = length(y)
    h = sigmoid(X * theta)
    # gradient = zeros(size(theta))
    gradient = (1/m) * X' * (h - y)
    gradient[2:end] = gradient[2:end] + (lambda/m) * theta[2:end]
    return gradient
end

(Where theta is a vector of parameters for the hypothesis function and lambda is a regularization parameter.)

Then, according to the instructions given here: https://github.com/JuliaOpt/Optim.jl I try to call the optimization function like this:

# those are handle functions I define to pass them as arguments:
c(theta::Vector) = cost(X, y, theta, lambda)
g!(theta::Vector, gradient::Vector) = grad(X, y, theta, lambda, gradient)

# then I do
optimize(c,some_initial_theta) 
# or maybe
optimize(c,g!,initial_theta,method = :l_bfgs) # try a different algorithm

In both cases it says that it fails to converge and the output looks kind of awkard:

julia> optimize(c,initial_theta)
Results of Optimization Algorithm
 * Algorithm: Nelder-Mead
 * Starting Point: [0.0,0.0,0.0,0.0,0.0]
 * Minimum: [1.7787162051775145,3.4584135105727145,-6.659680628594007,4.776952006060713,1.5034743945407143]
 * Value of Function at Minimum: -Inf
 * Iterations: 1000
 * Convergence: false
   * |x - x'| < NaN: false
   * |f(x) - f(x')| / |f(x)| < 1.0e-08: false
   * |g(x)| < NaN: false
   * Exceeded Maximum Number of Iterations: true
 * Objective Function Calls: 1013
 * Gradient Call: 0

julia> optimize(c,g!,initial_theta,method = :l_bfgs)
Results of Optimization Algorithm
 * Algorithm: L-BFGS
 * Starting Point: [0.0,0.0,0.0,0.0,0.0]
 * Minimum: [-6.7055e-320,-2.235e-320,-6.7055e-320,-2.244e-320,-6.339759952602652e-7]
 * Value of Function at Minimum: 0.693148
 * Iterations: 1
 * Convergence: false
   * |x - x'| < 1.0e-32: false
   * |f(x) - f(x')| / |f(x)| < 1.0e-08: false
   * |g(x)| < 1.0e-08: false
   * Exceeded Maximum Number of Iterations: false
 * Objective Function Calls: 75
 * Gradient Call: 75

Question

Is my method (from my first code listing) incorrect? Or am I misusing Optim.jl functions? Either way, what is the proper way to define and minimize the cost function here?

It's my first time with Julia and probably I'm doing something terribly wrong, but I can't tell what exactly. Any help will be appreciated!

EDIT

X and y are the training set, X is a 90x5 matrix, y a 90x1 vector (namely, my training set is taken from Iris - I don't think it matters).

machaerus
  • 749
  • 1
  • 6
  • 20
  • Programming questions are off-topic here. You will probably have better luck asking this question at stack-overflow. – Marc Claesen Sep 21 '15 at 16:47
  • Thanks for reply, @MarcClaesen, I asked it here since I'm not sure if my method is correct, not only implementation. Anyway, I'll try at SO. – machaerus Sep 21 '15 at 17:02
  • 1
    If there is a method problem hidden in your question, you might want to edit it to bring this to the forefront. Then, the programming question is less prominent and it might be considered on topic here. But if you want to try SO, flag the post for moving instead of re-posting. Thanks! – Momo Sep 21 '15 at 17:05
  • @Momo is right. Decide which aspect you think is most important & post your Q at the appropriate site. Be aware that few here will use Julia, so even if this is a method Q, if it is buried in the code & people can't read it, they won't be able to answer your Q even if they otherwise could. So you would need to clarify those issues. – gung - Reinstate Monica Sep 21 '15 at 17:14
  • 1
    I clarified my question as much as I could, but also flagged it for moving to SO. Thanks for advice. – machaerus Sep 21 '15 at 17:23
  • 3
    I don't see any option controlling output level, except maybe show_trace, which might show useful info and help you see what's going on. Maybe trying different starting values will help. Default tolerances are crazy (not a good sign). But here's the key, the package you are using is not exactly the top of the line. Per https://github.com/JuliaOpt/Optim.jl , "Because it is being developed from scratch, it is not as robust as the C-based NLOpt package. For work whose accuracy must be unquestionable, we recommend using the NLOpt package." Use something better, whether in Julia or elsewhere. – Mark L. Stone Sep 21 '15 at 17:54
  • Do you need constraints on any variables? Not saying you do, just asking. The starting value of all zeros might be problematic, especially if the algorithm's not so robust. Nelder-Mead is generally not a good choice, in fact it's a terrible choice. O.k., in my post above when I said the package is "not exactly the top of the line", I was being extremely charitable and diplomatic. – Mark L. Stone Sep 21 '15 at 18:04
  • It was only using Nelder-Mead becaue in the first case there was no derivative provided. In the second case, where a derivative is provided, it uses L-BFGS. – IainDunning Sep 21 '15 at 20:28
  • Can you provide your `X` and `y`? – IainDunning Sep 21 '15 at 20:30
  • IainDunning I guess there' s no finite difference BFGS version? Wouldn't that be better than Nelder-Mead if the gradient is not available but the objective function is smooth? Do BFGS or L-BFGS use trust regions, line search, or neither? – Mark L. Stone Sep 22 '15 at 01:36

2 Answers2

5

Here is an example of unregularized logistic regression that uses the autodifferentiation functionality of Optim.jl. It might help you with your own implementation.

using Optim

const X = rand(100, 3)
const true_β = [5,2,4]
const true_y =  1 ./ (1 + exp(-X*true_β))

function objective(β)
    y = 1 ./ (1 + exp(-X*β))
    return sum((y - true_y).^2)  # Use SSE, non-standard for log. reg.
end

println(optimize(objective, [3.0,3.0,3.0],
                autodiff=true, method=LBFGS()))

Which gives me

Results of Optimization Algorithm
 * Algorithm: L-BFGS
 * Starting Point: [3.0,3.0,3.0]
 * Minimizer: [4.999999945789497,1.9999999853962256,4.0000000047769495]
 * Minimum: 0.000000
 * Iterations: 14
 * Convergence: true
   * |x - x'| < 1.0e-32: false
   * |f(x) - f(x')| / |f(x)| < 1.0e-08: false
   * |g(x)| < 1.0e-08: true
   * Exceeded Maximum Number of Iterations: false
 * Objective Function Calls: 53
 * Gradient Call: 53
IainDunning
  • 11,546
  • 28
  • 43
  • Are there any downsides of using this approach? Isn't using autodifferentation slower than when the gradient is explicitly provided? – machaerus Sep 21 '15 at 22:35
  • 3
    It is slower (normally), but could be useful for helping to check your implementation (which I suspect might the reason you are having trouble). – IainDunning Sep 22 '15 at 02:36
  • Could you have a update of this script? Not work on Julia v1.2. – Dongdong Kong Sep 30 '19 at 19:06
5

Below you find my cost and gradient computation functions for Logistic Regression using closures and currying (version for those who got used to a function that returns the cost and gradient):

function cost_gradient(θ, X, y, λ)
    m = length(y)
    return (θ::Array) -> begin 
        h = sigmoid(X * θ)   
        J = (1 / m) * sum(-y .* log(h) .- (1 - y) .* log(1 - h)) + λ / (2 * m) * sum(θ[2:end] .^ 2)         
    end, (θ::Array, storage::Array) -> begin  
        h = sigmoid(X * θ) 
        storage[:] = (1 / m) * (X' * (h .- y)) + (λ / m) * [0; θ[2:end]]        
    end
end

Sigmoid function implementation:

sigmoid(z) = 1.0 ./ (1.0 + exp(-z))

To apply cost_gradient in Optim.jl do the following:

using Optim
#...
# Prerequisites:
# X size is (m,d), where d is the number of training set features
# y size is (m,1)
# λ as the regularization parameter, e.g 1.5
# ITERATIONS number of iterations, e.g. 1000
X=[ones(size(X,1)) X] #add x_0=1.0 column; now X size is (m,d+1)
initialθ = zeros(size(X,2),1) #initialTheta size is (d+1, 1)
cost, gradient! = cost_gradient(initialθ, X, y, λ)
res = optimize(cost, gradient!, initialθ, method = ConjugateGradient(), iterations = ITERATIONS);
θ = Optim.minimizer(res);

Now, you can easily predict (e.g. training set validation):

predictions = sigmoid(X * θ) #X size is (m,d+1)

Either try my approach or compare it with your implementation.

pkofod
  • 764
  • 7
  • 18
Maciek Leks
  • 1,288
  • 11
  • 21