Wrong weights using batch gradient descent

Question

I am working on linear regression with two-dimensional data but I cannot get the correct weights for the regression line. There seems to be a problem with the following code because the calculated weights for the regression line are not correct. Using too large data values, around 80000 for x, results in NaN for the weights. Scaling the data from 0 to 1, results in wrong weights because the regression line does not match the data.

function [w, epoch_batch, error_batch] = batch_gradient_descent(x, y)

% number of examples
q = size(x,1);

% learning rate
alpha = 1e-10;

w0 = rand(1);
w1 = rand(1);

curr_error = inf;
eps = 1e-7;

epochs = 1e100;
epoch_batch = 1;
error_batch = inf;
for epoch = 1:epochs
    prev_error = curr_error;
    curr_error = sum((y - (w1.*x + w0)).^2);
    w0 = w0 + alpha/q * sum(y - (w1.*x + w0));
    w1 = w1 + alpha/q * sum((y - (w1.*x + w0)).*x);
    if ((abs(prev_error - curr_error) < eps))
        epoch_batch = epoch;
        error_batch = abs(prev_error - curr_error);
        break;
    end
end

w = [w0, w1];

Could you tell me where I made an error because for me it seems correct after hours of trying.

Data:

Here is the code to plot the data:

figure(1)
% plot data points
plot(x, y, 'ro');
hold on;
xlabel('x value');
ylabel('y value');
grid on;

% x vector from min to max data point
x = min(x):max(x);
% calculate y with weights from batch gradient descent
y = (w(1) + w(2)*x);
% plot the regression line
plot(x,y,'r');

The weights for the unscaled data set could be found using a smaller learning rate alpha = 1e-10. However, when scaling the data from 0 to 1, I still have troubles to get the matching weights.

scaled_x =

scaled_y_en =

I added the unscaled data which results in NaN values for the weights. Scaling from 0 to 1, by dividing through the max value, returns wrong weights which do not match the data. — evolved, Mar 07 '16 at 13:32
most likely the error is in `w1 = w1 + alpha/q * sum((y - (w1.*x + w0)).*x);`, as this line does **not** make `sum(y - (w1.*x + w0))` smaller, thus its going in oposite direction of the minimization. — Ander Biguri, Mar 07 '16 at 13:54
I would like to minimize the cost function `J(w) = sum(yj - h_w(xj))^2` over all samples `(j = 1 to q)`. Where `h_w(xj) = w1*x + w0`. — evolved, Mar 07 '16 at 14:00
@evolved. I don't think your formula for the gradient is correct. You are missing a factor of two. That by itself might be responsible for your non-convergence, although it shouldn't since you are multiplying by an arbitrary constant anyway. — Mad Physicist, Mar 07 '16 at 14:24
@MadPhysicist. The 2 is folded into the learning rate according to the Artificial intelligence book (Russell, Norvig). The weight update formulas for w0 and w1 are straight from that book too. Maybe I did something wrong with the sum command in matlab? — evolved, Mar 07 '16 at 15:18

score 4 · Accepted Answer · answered Mar 07 '16 at 13:58

4

The problem is with w1, as you are giving it a too big weight. You should not give w0 and w1 the same learning step, as one is not multiplied by x.

If I substitute alpha/q by alpha^4/q (because random choice) then it converges:

answered Mar 07 '16 at 13:58

Ander Biguri

35,140
11
74
120

Thanks for your help! I changed w1 = w1 + alpha^4/q * sum((y - (w1.*x + w0)).*x); using alpha^4 but it does not make any difference. Still NaN for the weights. – evolved Mar 07 '16 at 14:12
What about `alpha/q^2` for the `w1` case? – Mad Physicist Mar 07 '16 at 14:12
@evolved it works for me.... I just copy pasted your code, so it must work. oh, `alpha=0.001` – Ander Biguri Mar 07 '16 at 14:15
Thank you, it worked using alpha=0.001. However, I don't understand why it is necessary to have a different learning step for w1. In the book Artificial Intelligence a modern approach (Russel, Norvig), the alpha is the same for both weights. And they state that convergence is guaranteed as long as we pick alpha small enough. – evolved Mar 07 '16 at 14:29
2

`0.01` is not small enough. – Mad Physicist Mar 07 '16 at 14:30
Just tried it using alpha = 1e-10; and it works with the same alpha. – evolved Mar 07 '16 at 14:31
@evolved definetly `alpha=0.001` works with the given data a,d `alpha^4/q` in w1. – Ander Biguri Mar 07 '16 at 15:10
yes it works with the unscaled data, but not with the scaled data (range from 0 to 1). Shouldn't it work with scaled and unscaled data? – evolved Mar 07 '16 at 15:13
3

@evolved Not necesarily, no. You are giving a arbitrary, users specified learning rate, and its dependent on the scale of the data. thats why most of the algorithm use normalized data. – Ander Biguri Mar 07 '16 at 15:14
Ok thank you @AnderBiguri. And can you also tell me how to find an appropriate learning rate for the scaled data (see my edited answer) in order to get matching weights? Or is there a general rule to find the "correct" alpha? – evolved Mar 07 '16 at 15:20
@evolved nop! Welcome to the amazing world of optimization. No there is no universal way of setting that. usually your fucntion would be `batch_gradient_descent(x,y,alpha)`. There are research papers that try to optimize this weigths, but this is still an open problem in mathematics. – Ander Biguri Mar 07 '16 at 15:22
@AnderBiguri so I have to try in order to get the matching weights?! – evolved Mar 07 '16 at 15:24
1

@evolved yeah... Rreally, there is no magic happening here. Generally, if normalized, a value of `[0.01-0.5]` is good, but you just need to try.... – Ander Biguri Mar 07 '16 at 15:25
1

I would also recommend **normalizing** your data with zero mean and unit variance so that the algorithm can converge faster. However, the weights will be with respect to the normalized data so if you want to perform any predictions, you must take this data and normalize it using the mean and variance from your training data. This post may give more insight: http://stackoverflow.com/questions/35419882/cost-function-in-logistic-regression-gives-nan-as-a-result/35422981#35422981 - However, it is for logistic regression and not linear, but the update rule is almost the same. – rayryeng Mar 07 '16 at 19:56
1

@evolved check what ray has to say, he knows this stuff. – Ander Biguri Mar 07 '16 at 20:22

Wrong weights using batch gradient descent

1 Answers1