10

I am trying to implement batch gradient descent on a data set with a single feature and multiple training examples (m).

When I try using the normal equation, I get the right answer but the wrong one with this code below which performs batch gradient descent in MATLAB.

 function [theta] = gradientDescent(X, y, theta, alpha, iterations)
      m = length(y);
      delta=zeros(2,1);
      for iter =1:1:iterations
          for i=1:1:m
              delta(1,1)= delta(1,1)+( X(i,:)*theta - y(i,1))  ;
              delta(2,1)=delta(2,1)+ (( X(i,:)*theta - y(i,1))*X(i,2)) ;
          end
          theta= theta-( delta*(alpha/m) );
        computeCost(X,y,theta)
      end
end

y is the vector with target values, X is a matrix with the first column full of ones and second columns of values (variable).

I have implemented this using vectorization, i.e

theta = theta - (alpha/m)*delta

... where delta is a 2 element column vector initialized to zeroes.

The cost function J(Theta) is 1/(2m)*(sum from i=1 to m [(h(theta)-y)^2]).

rayryeng
  • 102,964
  • 22
  • 184
  • 193
Sridhar Thiagarajan
  • 580
  • 1
  • 7
  • 20
  • consider to change tag [tag:batch-file] to [tag:batch-processing]... – aschipfl Aug 28 '15 at 15:27
  • 1
    @aschipfl - Good suggestion. However, this post has nothing to do with batch processing or anything batch related in terms of files or data. In this context, it's equated with a technique in machine learning so I've removed the tag as it isn't suitable. Probably a mistag on the OP's part. Thanks for the comment though! – rayryeng Aug 28 '15 at 16:25
  • 1
    oops, I was too quick typing and didn't read read the Q carefully enough... thanks for correcting! – aschipfl Aug 28 '15 at 16:33

1 Answers1

28

The error is very simple. Your delta declaration should be inside the first for loop. Every time you accumulate the weighted differences between the training sample and output, you should start accumulating from the beginning.

By not doing this, what you're doing is accumulating the errors from the previous iteration which takes the error of the the previous learned version of theta into account which isn't correct. You must put this at the beginning of the first for loop.

In addition, you seem to have an extraneous computeCost call. I'm assuming this calculates the cost function at every iteration given the current parameters, and so I'm going to create a new output array called cost that shows you this at each iteration. I'm also going to call this function and assign it to the corresponding elements in this array:

function [theta, costs] = gradientDescent(X, y, theta, alpha, iterations)
    m = length(y);
    costs = zeros(m,1); %// New
%    delta=zeros(2,1); %// Remove
    for iter =1:1:iterations
    delta=zeros(2,1); %// Place here
   for i=1:1:m
       delta(1,1)= delta(1,1)+( X(i,:)*theta - y(i,1))  ;
       delta(2,1)=delta(2,1)+ (( X(i,:)*theta - y(i,1))*X(i,2)) ;
   end
    theta= theta-( delta*(alpha/m) );
   costs(iter) = computeCost(X,y,theta); %// New
end
end

A note on proper vectorization

FWIW, I don't consider this implementation completely vectorized. You can eliminate the second for loop by using vectorized operations. Before we do that, let me cover some theory so we're on the same page. You are using gradient descent here in terms of linear regression. We want to seek the best parameters theta that are our linear regression coefficients that seek to minimize this cost function:

enter image description here

m corresponds to the number of training samples we have available and x^{i} corresponds to the ith training example. y^{i} corresponds to the ground truth value we have associated with the ith training sample. h is our hypothesis, and it is given as:

enter image description here

Note that in the context of linear regression in 2D, we only have two values in theta we want to compute - the intercept term and the slope.

We can minimize the cost function J to determine the best regression coefficients that can give us the best predictions that minimize the error of the training set. Specifically, starting with some initial theta parameters... usually a vector of zeroes, we iterate over iterations from 1 up to as many as we see fit, and at each iteration, we update our theta parameters by this relationship:

enter image description here

For each parameter we want to update, you need to determine the gradient of the cost function with respect to each variable and evaluate what that is at the current state of theta. If you work this out using Calculus, we get:

enter image description here

If you're unclear with how this derivation happened, then I refer you to this nice Mathematics Stack Exchange post that talks about it:

https://math.stackexchange.com/questions/70728/partial-derivative-in-gradient-descent-for-two-variables

Now... how can we apply this to our current problem? Specifically, you can calculate the entries of delta quite easily analyzing all of the samples together in one go. What I mean is that you can just do this:

function [theta, costs] = gradientDescent(X, y, theta, alpha, iterations)
    m = length(y);
    costs = zeros(m,1);
    for iter = 1 : iterations
        delta1 = theta(1) - (alpha/m)*(sum((theta(1)*X(:,1) + theta(2)*X(:,2) - y).*X(:,1)));
        delta2 = theta(2) - (alpha/m)*(sum((theta(1)*X(:,1) + theta(2)*X(:,2) - y).*X(:,2)));

        theta = [delta1; delta2];
        costs(iter) = computeCost(X,y,theta);
    end
end

The operations on delta(1) and delta(2) can completely be vectorized in a single statement for both. What you are doing theta^{T}*X^{i} for each sample i from 1, 2, ..., m. You can conveniently place this into a single sum statement.

We can go even further and replace this with purely matrix operations. First off, what you can do is compute theta^{T}*X^{i} for each input sample X^{i} very quickly using matrix multiplication. Suppose if:

enter image description here

Here, X is our data matrix which composes of m rows corresponding to m training samples and n columns corresponding to n features. Similarly, theta is our learned weight vector from gradient descent with n+1 features accounting for the intercept term.

If we compute X*theta, we get:

enter image description here

As you can see here, we have computed the hypothesis for each sample and have placed each into a vector. Each element of this vector is the hypothesis for the ith training sample. Now, recall what the gradient term of each parameter is in gradient descent:

enter image description here

We want to implement this all in one go for all of the parameters in your learned vector, and so putting this into a vector gives us:

enter image description here

Finally:

enter image description here

Therefore, we know that y is already a vector of length m, and so we can very compactly compute gradient descent at each iteration by:

theta = theta - (alpha/m)*X'*(X*theta - y);

.... so your code is now just:

function [theta, costs] = gradientDescent(X, y, theta, alpha, iterations)
    m = length(y);
    costs = zeros(m, 1);
    for iter = 1 : iterations
        theta = theta - (alpha/m)*X'*(X*theta - y);
        costs(iter) = computeCost(X,y,theta);
    end
end
rayryeng
  • 102,964
  • 22
  • 184
  • 193
  • Nice catch!Thank you so much..that had me up for a long time. – Sridhar Thiagarajan Aug 28 '15 at 15:52
  • @SridharThiagarajan - No problem :) Please see my edit on how to get this working even more vectorized. – rayryeng Aug 28 '15 at 15:54
  • Thanks.. Learnt a lot from that. Apart from this being easier to read.. Does this offer computational advantage? – Sridhar Thiagarajan Aug 29 '15 at 00:22
  • Yup. Vectorization usually gives faster run times. Try timing the code using linear algebra and your current code. You'll see time savings. MATLAB has its backbone on matrix operations... So take advantage! – rayryeng Aug 29 '15 at 01:02
  • Okay great! Will do that – Sridhar Thiagarajan Aug 29 '15 at 02:11
  • That's a great explanation sir. But I don't get the last part, I can't understand how you get `X^T` from the sum in the gradient descent equation. Thank you in advance. – Iván Rodríguez Torres Sep 28 '16 at 14:25
  • 1
    @IvanRodriguezTorres physically write out all terms in the transpose of `X` and the difference vector between the hypothesis and the true values. Multiply this matrix and vector together and you'll see that the expressions are equivalent. – rayryeng Sep 28 '16 at 14:26
  • How do you reach the conclusion that X^T(h-X) = sum in the gradient descent? Thank you. – Bhoomtawath Plinsut Nov 01 '16 at 03:05
  • @BhoomtawathPlinsut as I mentioned to Ivan, physically write out the terms in the transpose of `X` as well as the difference vector between the hypothesis and the true values. Multiply the transpose of the matrix and the difference vector together and you'll see that it's equivalent. – rayryeng Nov 01 '16 at 04:25
  • Thank you. I understand that they are equivalent, but how do you know that they are equivalent in the first place. Is it a written definition? – Bhoomtawath Plinsut Nov 01 '16 at 04:46
  • 1
    @BhoomtawathPlinsut ah. Well that came from intuition. For example, notice that the gradient terms are simply dot products. Each term in the gradient is a dot product with the difference vector between the hypothesis and the true values. Also, for each term we can see that we are taking the dot product between this difference vector and each feature for all examples as a vector. We can do this simultaneously for all features by transposing the training example matrix `X` and performing dot products with the rows of this matrix with this said difference vector. It isn't written. Just intuition. – rayryeng Nov 01 '16 at 05:36
  • @BhoomtawathPlinsut if you've found this answer useful, please consider giving an upvote. – rayryeng Nov 01 '16 at 05:36
  • 1
    @rayryeng Upvoted. Thank you for your clarification. – Bhoomtawath Plinsut Nov 01 '16 at 06:39