Gradient Descent implementation in octave

Question

I've actually been struggling against this for like 2 months now. What is it that makes these different?

hypotheses= X * theta
temp=(hypotheses-y)'
temp=X(:,1) * temp
temp=temp * (1 / m)
temp=temp * alpha
theta(1)=theta(1)-temp

hypotheses= X * theta
temp=(hypotheses-y)'
temp=temp * (1 / m)
temp=temp * alpha
theta(2)=theta(2)-temp



theta(1) = theta(1) - alpha * (1/m) * ((X * theta) - y)' * X(:, 1);
theta(2) = theta(2) - alpha * (1/m) * ((X * theta) - y)' * X(:, 2);

The latter works. I'm just not sure why..I struggle to understand the need for the matrix inverse .

I dont think this is a proper implementation of gradient descent. You need to update. Both your thetas at the same time to be accurate. `tmpTheta1= theta(1) - alpha * (1/m) * ((X * theta) - y)' * X(:, 1); tmpTheta2= theta(2) - alpha * (1/m) * ((X * theta) - y)' * X(:, 2);` `theta(1)=tmpTheta1;` `theta(2)=tmpTheta2;` — Einar Sundgren, Apr 29 '13 at 08:42

score 79 · Accepted Answer · edited Oct 04 '14 at 15:52

79

What you're doing in the first example in the second block you've missed out a step haven't you? I am assuming you concatenated X with a vector of ones.

   temp=X(:,2) * temp

The last example will work but can be vectorized even more to be more simple and efficient.

I've assumed you only have 1 feature. it will work the same with multiple features since all that happens is you add an extra column to your X matrix for each feature. Basically you add a vector of ones to x to vectorize the intercept.

You can update a 2x1 matrix of thetas in one line of code. With x concatenate a vector of ones making it a nx2 matrix then you can calculate h(x) by multiplying by the theta vector (2x1), this is (X * theta) bit.

The second part of the vectorization is to transpose (X * theta) - y) which gives you a 1*n matrix which when multiplied by X (an n*2 matrix) will basically aggregate both (h(x)-y)x0 and (h(x)-y)x1. By definition both thetas are done at the same time. This results in a 1*2 matrix of my new theta's which I just transpose again to flip around the vector to be the same dimensions as the theta vector. I can then do a simple scalar multiplication by alpha and vector subtraction with theta.

X = data(:, 1); y = data(:, 2);
m = length(y);
X = [ones(m, 1), data(:,1)]; 
theta = zeros(2, 1);        

iterations = 2000;
alpha = 0.001;

for iter = 1:iterations
     theta = theta -((1/m) * ((X * theta) - y)' * X)' * alpha;
end

edited Oct 04 '14 at 15:52

KLDavenport

659
8
24

answered Mar 19 '14 at 19:01

Shaun Ryan

1,458
12
16

1

why do you need to transpose (1/m) * ((X * theta) - y)' * X in your for loop ? – Graham Slick Mar 27 '16 at 14:22
Same question as Grahm, why is that entire subexpression tranposed ? – qbert65536 Apr 28 '16 at 22:37
4

The result of `((1/m) * ((X * theta) - y)' * X)` is 1x2. `theta` is 2x1. So the bit between brackets needs to be transposed to have the same dimensions and subtract it from `theta`. – AronVanAmmers May 10 '16 at 22:23
2

Same question as above. It should be theta = theta - (alpha/m) * X' * (X * theta - y) – Weihui Guo Nov 27 '16 at 00:49
1

I think this is related with the rules for matrix calculation... A*B = product of A lines by B rows... I mean, conceptually, its okay to use general formulas for gradient descendent, but then, when you're playing in the matrix fields, you have to adapt to their own rules (multiplication of lines by rows, commutative restrictions, etc.)... This is just my guess, maybe I am wrong... – Telmo May 16 '17 at 21:57
As stated -> "This results in a 1*2 matrix of my new theta's which I just transpose again to flip around the vector to be the same dimensions as the theta vector. I can then do a simple scalar multiplication by alpha and vector subtraction with theta.". In order to calculate using vectors the dimensionality has to be correct. If I conform the dimensions then it significantly simplifies the final calculation. When putting this together (and any code) i simplified it right down to its most minimal form - given the time. – Shaun Ryan Jun 23 '20 at 06:54

score 10 · Answer 2 · answered Oct 18 '17 at 21:45

10

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
% Performs gradient descent to learn theta. Updates theta by taking num_iters 
% gradient steps with learning rate alpha.

% Number of training examples
m = length(y); 
% Save the cost J in every iteration in order to plot J vs. num_iters and check for convergence 
J_history = zeros(num_iters, 1);

for iter = 1:num_iters
    h = X * theta;
    stderr = h - y;
    theta = theta - (alpha/m) * (stderr' * X)';
    J_history(iter) = computeCost(X, y, theta);
end

end

answered Oct 18 '17 at 21:45

skeller88

4,276
1
32
34

How can we print 'J_history' in the command line? – Cenk Nov 27 '20 at 14:59
Allright I have solved it: [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters); – Cenk Nov 27 '20 at 15:39

score 9 · Answer 3 · answered May 16 '12 at 22:47

In the first one, if X were a 3x2 matrix and theta were a 2x1 matrix, then "hypotheses" would be a 3x1 matrix.

Assuming y is a 3x1 matrix, then you can perform (hypotheses - y) and get a 3x1 matrix, then the transpose of that 3x1 is a 1x3 matrix assigned to temp.

Then the 1x3 matrix is set to theta(2), but this should not be a matrix.

The last two lines of your code works because, using my mxn examples above,

(X * theta)

would be a 3x1 matrix.

Then that 3x1 matrix is subtracted by y (a 3x1 matrix) and the result is a 3x1 matrix.

(X * theta) - y

So the transpose of the 3x1 matrix is a 1x3 matrix.

((X * theta) - y)'

Finally, a 1x3 matrix times a 3x1 matrix will equal a scalar or 1x1 matrix, which is what you are looking for. I'm sure you knew already, but just to be thorough, the X(:,2) is the second column of the 3x2 matrix, making it a 3x1 matrix.

score 4 · Answer 4 · edited Oct 22 '13 at 20:42

4

When you update you need to do like

Start Loop {

temp0 = theta0 - (equation_here);

temp1 = theta1 - (equation_here);


theta0 =  temp0;

theta1 =  temp1;

} End loop

edited Oct 22 '13 at 20:42

sody

3,731
1
23
28

answered Oct 22 '13 at 19:21

hbr

469
1
6
7

score 3 · Answer 5 · answered Nov 01 '18 at 12:13

3

This can be vectorized more simply with

h = X * theta   % m-dimensional matrix (prediction our hypothesis gives per training example)
std_err = h - y  % an m-dimensional matrix of errors (one per training example)
theta = theta - (alpha/m) * X' * std_err

Remember X, is the design matrix, and as such each row of X represents a training example and each column of X represents a given component (say the zeroth or first components) across all training examples. Each column of X is therefore exactly the thing we want to multiply element-wise with the std_err before summing to get the corresponding component of the theta vector.

answered Nov 01 '18 at 12:13

fpghost

2,834
4
32
61

1

This seems all and good. But why are we allowed to transpose X? Will that not change the value? Many here just explains it with, we have to do it to make the matrices correct. But why? Is X' the derivate? – Tuxedo Joe Feb 11 '20 at 19:19
``` X * theta ``` Is it the dot product or just an element wise operation ? – Garde Des Ombres Aug 27 '20 at 13:45

score -3 · Answer 6 · edited Mar 19 '19 at 20:33

-3

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1 : num_iters
    hypothesis = X * theta;
    Error = (hypothesis - y);
    temp = theta - ((alpha / m) * (Error' * X)');
    theta = temp;
    J_history(iter) = computeCost(X, y, theta);
end
end

edited Mar 19 '19 at 20:33

Sardar Usama

19,536
9
36
58

answered Mar 19 '19 at 19:59

Pritha Majumder

1
2

3

please explain how this answer the question – Shivani Katukota Mar 19 '19 at 20:06

score -9 · Answer 7 · edited Aug 12 '15 at 02:36

.
.
.
.
.
.
.
.
.
Spoiler alert












m = length(y); % number of training examples
J_history = zeros(num_iters, 1);

for iter = 1:num_iters

% ====================== YOUR CODE HERE ======================
% Instructions: Perform a single gradient step on the parameter vector
%               theta. 
%
% Hint: While debugging, it can be useful to print out the values
%       of the cost function (computeCost) and gradient here.
% ========================== BEGIN ===========================


t = zeros(2,1);
J = computeCost(X, y, theta);
t = theta - ((alpha*((theta'*X') - y'))*X/m)';
theta = t;
J1 = computeCost(X, y, theta);

if(J1>J),
    break,fprintf('Wrong alpha');
else if(J1==J)
    break;
end;


% ========================== END ==============================

% Save the cost J in every iteration    
J_history(iter) = sum(computeCost(X, y, theta));
end
end

idea is to help users to get forward, not post full examples for actual home exercises — EdvardM, Jun 26 '15 at 12:55
Please add advice and the idea to answer OP's question. Pure code does not help anyone. — Saif Ul Islam, May 12 '20 at 12:11

Gradient Descent implementation in octave

7 Answers7

Linked