Cost function in logistic regression gives NaN as a result

Question

I am implementing logistic regression using batch gradient descent. There are two classes into which the input samples are to be classified. The classes are 1 and 0. While training the data, I am using the following sigmoid function:

t = 1 ./ (1 + exp(-z));

where

z = x*theta

And I am using the following cost function to calculate cost, to determine when to stop training.

function cost = computeCost(x, y, theta)
    htheta = sigmoid(x*theta);
    cost = sum(-y .* log(htheta) - (1-y) .* log(1-htheta));
end

I am getting the cost at each step to be NaN as the values of htheta are either 1 or zero in most cases. What should I do to determine the cost value at each iteration?

This is the gradient descent code for logistic regression:

function [theta,cost_history] = batchGD(x,y,theta,alpha)

cost_history = zeros(1000,1);

for iter=1:1000
  htheta = sigmoid(x*theta);
  new_theta = zeros(size(theta,1),1);
  for feature=1:size(theta,1)
    new_theta(feature) = theta(feature) - alpha * sum((htheta - y) .*x(:,feature))                         
  end
  theta = new_theta;
  cost_history(iter) = computeCost(x,y,theta);
end
end

What language are you using for coding that? Could you provide a minimal reproducible example along with data? — Arton Dorneles, Feb 15 '16 at 22:06
The data consists of 57 features and has a label either 1 or 0, which is the y vector — Neel Shah, Feb 15 '16 at 22:07
It would be nice if you could provide a link with your data file. Do you verify the NaN values through the `cost_history` variable? Note that this variable has size 1000, but you is running 5000000 iterations. So `cost_history(iter) = computeCost(x,y,theta);` may be defining values that are out of range. — Arton Dorneles, Feb 15 '16 at 23:33
This is highly dependent on your input data which you have neglected to include. What does your data matrix `x` look like? — rayryeng, Feb 16 '16 at 00:03
https://archive.ics.uci.edu/ml/datasets/Spambase This is the link to the dataset — Neel Shah, Feb 16 '16 at 02:22

score 29 · Accepted Answer · edited May 23 '17 at 11:46

There are two possible reasons why this may be happening to you.

The data is not normalized

This is because when you apply the sigmoid / logit function to your hypothesis, the output probabilities are almost all approximately 0s or all 1s and with your cost function, log(1 - 1) or log(0) will produce -Inf. The accumulation of all of these individual terms in your cost function will eventually lead to NaN.

Specifically, if y = 0 for a training example and if the output of your hypothesis is log(x) where x is a very small number which is close to 0, examining the first part of the cost function would give us 0*log(x) and will in fact produce NaN. Similarly, if y = 1 for a training example and if the output of your hypothesis is also log(x) where x is a very small number, this again would give us 0*log(x) and will produce NaN. Simply put, the output of your hypothesis is either very close to 0 or very close to 1.

This is most likely due to the fact that the dynamic range of each feature is widely different and so a part of your hypothesis, specifically the weighted sum of x*theta for each training example you have will give you either very large negative or positive values, and if you apply the sigmoid function to these values, you'll get very close to 0 or 1.

One way to combat this is to normalize the data in your matrix before performing training using gradient descent. A typical approach is to normalize with zero-mean and unit variance. Given an input feature x_k where k = 1, 2, ... n where you have n features, the new normalized feature x_k^{new} can be found by:

m_k is the mean of the feature k and s_k is the standard deviation of the feature k. This is also known as standardizing data. You can read up on more details about this on another answer I gave here: How does this code for standardizing data work?

Because you are using the linear algebra approach to gradient descent, I'm assuming you have prepended your data matrix with a column of all ones. Knowing this, we can normalize your data like so:

mX = mean(x,1); 
mX(1) = 0; 
sX = std(x,[],1); 
sX(1) = 1; 
xnew = bsxfun(@rdivide, bsxfun(@minus, x, mX), sX);

The mean and standard deviations of each feature are stored in mX and sX respectively. You can learn how this code works by reading the post I linked to you above. I won't repeat that stuff here because that isn't the scope of this post. To ensure proper normalization, I've made the mean and standard deviation of the first column to be 0 and 1 respectively. xnew contains the new normalized data matrix. Use xnew with your gradient descent algorithm instead. Now once you find the parameters, to perform any predictions you must normalize any new test instances with the mean and standard deviation from the training set. Because the parameters learned are with respect to the statistics of the training set, you must also apply the same transformations to any test data you want to submit to the prediction model.

Assuming you have new data points stored in a matrix called xx, you would do normalize then perform the predictions:

xxnew = bsxfun(@rdivide, bsxfun(@minus, xx, mX), sX);

Now that you have this, you can perform your predictions:

pred = sigmoid(xxnew*theta) >= 0.5;

You can change the threshold of 0.5 to be whatever you believe is best that determines whether examples belong in the positive or negative class.

The learning rate is too large

As you mentioned in the comments, once you normalize the data the costs appear to be finite but then suddenly go to NaN after a few iterations. Normalization can only get you so far. If your learning rate or alpha is too large, each iteration will overshoot in the direction towards the minimum and would thus make the cost at each iteration oscillate or even diverge which is what is appearing to be happening. In your case, the cost is diverging or increasing at each iteration to the point where it is so large that it can't be represented using floating point precision.

As such, one other option is to decrease your learning rate alpha until you see that the cost function is decreasing at each iteration. A popular method to determine what the best learning rate would be is to perform gradient descent on a range of logarithmically spaced values of alpha and seeing what the final cost function value is and choosing the learning rate that resulted in the smallest cost.

Using the two facts above together should allow gradient descent to converge quite nicely, assuming that the cost function is convex. In this case for logistic regression, it most certainly is.

I am getting few values properly, but most of the values are still NaN. Any way to overcome this too? — Neel Shah, Feb 16 '16 at 05:56
Yes if that is happening, one way is to enforce a cap on large negative and positive values. In your cost function file before you compute the sum, you can do something like `htheta(htheta >= 100) = 100; htheta(htheta <= -100) = -100;` This will ensure that when you apply the `log` to your hypothesis vector, you will get floating-point friendly results. If you get a hypothesis that is larger than 100 or smaller than -100, then we can safely assume that we can classify the input into the 1 or 0 class respectively and so placing this cap on your results should be OK. — rayryeng, Feb 16 '16 at 06:04
I am still not able to get proper accuracy. This is the dataset which I am working on: https://archive.ics.uci.edu/ml/datasets/Spambase — Neel Shah, Feb 16 '16 at 06:39
I imported the data into Matlab and I checked if there are any missing values. There are no missing values. I am implementing logistic regression for my assignment. — Neel Shah, Feb 16 '16 at 07:15
I did normalize the data. What other processing methods should I apply? — Neel Shah, Feb 16 '16 at 07:54
@NeelShah This behaviour of a "few values being proper", do you notice that the cost function is **increasing** or **decreasing**? Specifically, at each iteration do you notice the cost function increasing in value or decreasing? If it's increasing, your learning rate is too large. Decrease it until you see that the cost function is decreasing at each step. — rayryeng, Feb 16 '16 at 08:06
@NeelShah my pleasure. Good luck on your project! Also thank you for the link to the spam email data. I'm currently teaching machine learning this quarter and I'll be using that data for the next lab assignment for my students. — rayryeng, Feb 16 '16 at 08:54
log(0) is not NaN. log(0) is -Inf which should become +Inf when negated. This should not generate a problem unless I'm missing something. — Matthew Gunn, Feb 16 '16 at 17:01
@MatthewGunn That observation was made after a series of edits. I found out later that it was the learning rate, but it's still good practice to normalize the data to ensure convergence via gradient descent. I'll edit my post. BTW, if the OP used a more efficient algorithm (i.e. BFGS, Conjugate Gradient, etc.), there would be no need to normalize the data. This is particular to gradient descent only. — rayryeng, Feb 16 '16 at 17:04
@MatthewGunn Figured out why the `NaN`s were happening. `y` can be 0 or 1 with this problem, and doing `y*log(x*theta)` where `x*theta` can be close to 0 would thus make `0*log(0)` and thus produce `NaN`. — rayryeng, Feb 16 '16 at 17:11
@ rayryeng He needs to rewrite his cost function then so that doesn't happen. — Matthew Gunn, Feb 16 '16 at 17:19
@MatthewGunn for sure. It will most likely change the values from `NaN` to `Inf`, and so you're stuck with the same problem. Normalization of the data is required so that the costs don't wildly diverge... and the learning rate plays a factor too. BTW, I've edited my post. Thank you for your comments! — rayryeng, Feb 16 '16 at 17:20
In my experience with logistic regression, it's quite easy for `htheta` to hit 0 or 1 and code should be robust to that. — Matthew Gunn, Feb 16 '16 at 17:23
the first point is commonly know as **Feature Scaling** , Its a very good topic to read about — Vipin Chaudhary, Mar 20 '17 at 17:48

Matthew Gunn · Answer 2 · 2017-06-10T02:56:57.350

6

Let's assume you have an observation where:

the true value is y_i = 1
your model is quite extreme and says that P(y_i = 1) = 1

Then your cost function will get a value of NaN because you're adding 0 * log(0), which is undefined. Hence:

Your formula for the cost function has a problem (there is a subtle 0, infinity issue)!

As @rayryeng pointed out, 0 * log(0) produces a NaN because 0 * Inf isn't kosher. This is actually a huge problem: if your algorithm believes it can predict a value perfectly, it incorrectly assigns a cost of NaN.

Instead of:

cost = sum(-y .* log(htheta) - (1-y) .* log(1-htheta));

You can avoid multiplying 0 by infinity by instead writing your cost function in Matlab as:

y_logical = y == 1;
cost = sum(-log(htheta(y_logical))) + sum( - log(1 - htheta(~y_logical)));

The idea is if y_i is 1, we add -log(htheta_i) to the cost, but if y_i is 0, we add -log(1 - htheta_i) to the cost. This is mathematically equivalent to -y_i * log(htheta_i) - (1 - y_i) * log(1- htheta_i) but without running into numerical problems that essentially stem from htheta_i being equal to 0 or 1 within the limits of double precision floating point.

edited Jun 10 '17 at 02:56

answered Feb 16 '16 at 17:16

Matthew Gunn

4,451
1
12
30

Can you elaborate? I did not understand how this will avoid NaN or Inf case. Thanks. – Neel Shah Feb 16 '16 at 19:18
@NeelShah The reason why is because this explicitly avoids the multiplication of `0*log(0)` if the situation were to arise. By indexing into your hypothesis and selecting out those values that belong to each class respectively, this avoids having any `NaN` values that may result in the sum computation. Your true reason why you are getting `NaN` is because your learning rate is too large, but what Matthew has suggested is great for making a more robust cost function. – rayryeng Feb 16 '16 at 19:19
1

BTW Matthew, you may get a dimension mismatch because indexing using `y_logical` and `~y_logical` may produce different sized vectors. I would suggest splitting up the `sum` into two separate operations... those where `y == 1` and those where `y == 0` then adding the two results together. – rayryeng Feb 16 '16 at 19:22
Ok understood. How can I implement this in Matlab? I am not able to figure out that. – Neel Shah Feb 16 '16 at 19:23
1

@NeelShah Something like: `cost = sum(-log(htheta(y == 1))) + sum(-log(1 - htheta(y == 0)));` should do nicely. – rayryeng Feb 16 '16 at 19:56
1

@rayryeng oops! you're right. should be corrected now. – Matthew Gunn Feb 16 '16 at 21:57
I don't understand - what is htheta(y_logical)? From the code posted above, htheta is not a function, it is a vector. – codewarrior Oct 08 '16 at 01:09
1

@codewarrior In Matlab, let's say you have a vector `x = [1, 2, 3, 4, 5, 6]';` you could do `y = x([1,0,1,1,0,1]')` and then `y` would be equal to `[1, 3, 4, 6]`. It's kind of like a .selectSubsetBasedOnLogicalMask function. Go to logical indexing on this page: https://www.mathworks.com/company/newsletters/articles/matrix-indexing-in-matlab.html – Matthew Gunn Oct 08 '16 at 01:35
@codewarrior the idea is that for each observation i, you want to add log(htheta_i) when y_i is 1 and you wan to add log(1 - htheta_i) when y_i is 0. The formula the OP uses doesn't *quite* do that because if y_i = 0 and htheta_i = 0, then 0 * log(0) creates an error. It adds NaN to the cost function instead of adding 0. – Matthew Gunn Oct 08 '16 at 01:43
@MatthewGunn: Thank you! – codewarrior Oct 08 '16 at 19:23
this does't save us from 0*inf problem, because if theta = 1, then you can easily get log(0) – Stepan Yakovenko Apr 29 '19 at 19:28
@StepanYakovenko You cannot get 0*inf problems: my code doesn't even call the multiply command! If theta_i = 1 and y_i = 0, then you do get an infinite cost, but that's correct. Your model says y_i should be 1 with 100% probability, but y_i is 0. – Matthew Gunn Apr 29 '19 at 22:33

score 2 · Answer 3 · answered May 23 '20 at 07:48

It happened to me because an indetermination of the type:

0*log(0)

This can happen when one of the predicted values Y equals either 0 or 1. In my case the solution was to add an if statement to the python code as follows:

y * np.log (Y)  + (1-y) * np.log (1-Y) if ( Y != 1 and Y != 0 ) else 0

This way, when the actual value (y) and the predicted one (Y) are equal, no cost needs to be computed, which is the expected behavior.

(Notice that when a given Y is converging to 0 the left addend is canceled (because of y=0) and the right addend tends toward 0. The same happens when Y converges to 1, but with the opposite addend.)

(There is also a very rare scenario, which you probably won't need to worry about, where y=0 and Y=1 or viceversa, but if your dataset is standarized and the weights are properly initialized it won't be an issue.)

FYI, the syntax shown here is in Python. This is a MATLAB question. — rayryeng, Feb 01 '22 at 05:23

Cost function in logistic regression gives NaN as a result

3 Answers3

The data is not normalized

The learning rate is too large

Your formula for the cost function has a problem (there is a subtle 0, infinity issue)!

Linked

Related