Why "theta" in this code is NaN?

Question

I'm learning neural networks (linear regression) in MATLAB for my research project and this is a part of the code I use. The problem is the value of "theta" is NaN and I don't know why. Could you tell me where is the error?

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
theta = zeros(2, 1); % initialize fitting parameters
%GRADIENTDESCENT Performs gradient descent to learn theta
% theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by 
% taking num_iters gradient steps with learning rate alpha
% Initialize some useful values
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
    theta = theta - ((alpha/m)*((X*theta)-y)' * X)';
end
end
% run gradient descent
theta = gradientDescent(X, y, theta, alpha, iterations);

Can you provide the values of the input arguments that you give to the function? — codeaviator, Jan 24 '17 at 15:12
@Cebri alpha = 0.01 and num_iters = 1500 and both of (X, y) are column vectors of 133×1 — Mohamed Nedal, Jan 24 '17 at 15:22
Given those input values, the line in the for loop shouldn't work. When you attempt to perform matrix multiplication between `X` (`size(X) = 133x1`) and `theta` (`size(theta) = 2x1`) you should get an `Inner matrix dimensions must agree` error. Also, why do you pass in the value of `theta` just to define it at a matrix of zeros? — Vladislav Martin, Jan 24 '17 at 15:31
@VladislavMartin I tried to play with the transpose of X and y and it gives me no error, but "theta" still has a NaN value. Could you propose a solution please? Theta is the parameter (weight) in the Cost function and its initial value is zero. It's adjusted with each iteration to minimize the error. This's not my code, I got it from an online course and I'm trying to modify it according to my application. — Mohamed Nedal, Jan 24 '17 at 16:04
@MimSaad (`size(theta) = 2×1`) and (`size(X) = 133×1 = size(y)`). In order to multiply X and y (for example), the no. of rows of X must equal the no. of columns of y. I tried this form for the "theta" line: `theta = theta - alpha * (1/m) * (X' * (X * theta - y));` but also it gives me NaN for theta. — Mohamed Nedal, Jan 24 '17 at 16:29

aksadv · Accepted Answer · 2017-01-26T23:48:25.497

The function you have is fine. But the sizes of X and theta are incompatible. In general, if size(X) is [N, M], then size(theta) should be [M, 1].

So I would suggest replacing the line

theta = zeros(2, 1);

with

theta = zeros(size(X, 2), 1);

should have as many columns as theta has elements. So in this example, size(X) should be [133, 2].

Also, you should move that initialization before you call the function.

For example, the following code does not return NaN if you remove the initialization of theta from the function.

X = rand(133, 1); % or rand(133, 2)
y = rand(133, 1);
theta = zeros(size(X, 2), 1); % initialize fitting parameters

% run gradient descent
theta = gradientDescent(X, y, theta, 0.1, 1500)

EDIT: This is in response to comments below.

Your problem is due to the gradient descent algorithm not converging. To see it yourself, plot J_history, which should never increase if the algorithm is stable. You can compute J_history by inserting the following line inside the for-loop in the function gradientDescent:

J_history(iter) = mean((X * theta - y).^2);

In your case (i.e. given data file and alpha = 0.01), J_history increases exponentially. This is shown in the plot below. Note that the y-axis is in logarithmic scale.

This is a clear sign of instability in gradient descent.

There are two ways to eliminate this problem.

Option 1. Use smaller alpha. alpha controls the rate of gradient descent. If it is too large, the algorithm is unstable. If it is too small, the algorithm takes a long time to reach the optimal solution. Try something like alpha = 1e-8 and go from there. For example, alpha = 1e-8 results in the following cost function:

Option 2. Use feature scaling to reduce the magnitude of the inputs. One way of doing this is called Standarization. The following is an example of using standarization and the resulting cost function:

data=xlsread('v & t.xlsx');
data(:,1) = (data(:,1)-mean(data(:,1)))/std(data(:,1));

I tried your code and it does not return `NaN` , only if I added this line before "theta" : `X = [ones(m,1), rand(133,1)]; % Add a column of ones to x ` but when I replace X and y with my values from Excel sheet it does not work and "theta" returns `NaN`. `X = data(:,6); y = data(:,1); X = [ones(m,1), data(:,6)]; theta = zeros(size(X, 2), 1); theta = gradientDescent(X, y, theta, 0.01, 1500); ` Could you please tell me where is the error? — Mohamed Nedal, Jan 25 '17 at 08:23
Did you make sure to remove the line `theta = zeros(2, 1);` from the function `gradientDescent`? Also make sure that `X` and `y` only contain valid finite numbers. As a test, what do `var(X)` and `var(y)` return? If you require further help, you could upload the data file to Google Drive and share publicly for us to examine. — aksadv, Jan 25 '17 at 18:49
Yes, I replaced this line with yours: `theta = zeros(size(x, 2), 1);` `var(x) = 0 312.1246` and `var(y) = 3.7478e+05` I was using the line `theta = zeros(2, 1);` because I was using this line `x = [ones(m, 1), data(:, 2)]; % To add a column of ones to x` This is my data file, the 1st column is the velocity (km/s) and the 2nd one is the time (hours): https://drive.google.com/file/d/0Bw5Fgx5h69chMlMzVlNpbEd6YzA/view?usp=sharing I really do appreciate your help. — Mohamed Nedal, Jan 26 '17 at 11:55
I tried to change the values of `alpha` and `iterations` and I found `theta` stopped return `NaN`, but I got a wrong linear regression. — Mohamed Nedal, Jan 26 '17 at 11:56
The problem is due to the gradient descent algorithm diverging for your choice of data and `alpha`. See a more detailed response in the answer above. — aksadv, Jan 26 '17 at 23:49
Thank you very much, I will try again and check the updates. — Mohamed Nedal, Jan 27 '17 at 04:09

Why "theta" in this code is NaN?

1 Answers1