Gradient Descent: Do we iterate on ALL of the training set with each step in GD? or Do we change GD for each training set?

Question

I've taught myself machine learning with some online resources but I have a question about gradient descent that I couldn't figure out.

The formula for gradient descent is given by the following logistics regression:

Repeat {
    θj = θj−α/m∑(hθ(x)−y)xj
}

Where θj is the coefficient on variable j; α is the learning rate; hθ(x) is the hypothesis; y is real value and xj is the value of variable j. m is the number of training sets. hθ(x), y are for each training set (i.e. that's what the summation sign is for).

This is where I get confused.

It's not clear to me if summation is representing my entire training set or how many iterations I have done up to that point.

For example, imagine I have 10 training examples. If I perform gradient descent after each training example, my coefficients will be very different then if I performed gradient descent after all 10 training examples.

See below how the First Way is different then the Second Way:

First way

Step 1: Since coefficients initialized to 0, hθ(x)=0
Step 2: Perform gradient descent on the first training example. Summation term only includes 1 training example
Step 3: Now use new coefficients for training examples 1 & 2... summation term includes first 2 training examples
Step 4: Perform gradient descent again.
Step 5: Now use new coefficients for training examples 1,2 &3... summation term includes first 3 training examples
Continue until convergence or all training examples used.

Second way

Step 1: Since coefficients initialized to 0, hθ(x)=0 for all 10 training examples
Step 2: Perform 1 step of gradient descent using all 10 training examples. Coefficients will be different from the First Way because the summation term includes all 10 training examples
Step 3: Use new coefficients on all 10 training examples again. summation term includes all 10 training examples
Step 4: Perform gradient descent and continue using coefficients on all examples until convergence

I hope that explains my confusion. Does anyone know which way is correct?

Edit: Adding cost function and hypothesis function

cost function = −1/m∑[ylog(hθ(x))+(1−y)log(1−hθ(x))]
hθ(x) = 1/(1+ e^-z) 
and z= θo + θ1X1+θ2X2 +θ3X3...θnXn

Gradient descent for what machine learning method/loss function? perceptron? svm? — carlosdc, Jun 24 '13 at 19:58
@carlosdc It's not SVM so I'm assuming it's perceptron? Sorry but the course where I learned it from (coursera) never used the term perceptron. It did teach SVM, but that was later in the course and I'm not using that equation so I'm assuming this isn't a SVM. Anyway I could tell? — Terence Chow, Jun 24 '13 at 20:11
@carlosdc the equation I provided is the derivative of the cost function...The hypothesis is `1/(1+ e^-z)` and `z= θo + θ1X1+θ2X2 +θ3X3...θnXn`...The cost function is: `−1/m∑[ylog(hθ(x))+(1−y)log(1−hθ(x))]` — Terence Chow, Jun 24 '13 at 20:15

score 7 · Accepted Answer · answered Jun 24 '13 at 21:04

The second way you are describing it is the correct way to perform Gradient Descent. The true gradient is dependent on the whole data set, so one iteration of gradient descent requires using all of the data set. (This is true for any learning algorithm where you can take the gradient)

The "first way" is close to something that is called Stochastic Gradient Descent. The idea here is that using the whole data set for one update might be overkill, especially if some of the data points are redundant. In this case, we pick a random point from the data set - essentially setting m=1. We then update based on successive selections of single points in the data set. This way we can do m updates at about the same cost as one update of Gradient Descent. But each update is a bit noisy, which can make convergence to the final solution difficult.

The compromise between these approaches is called "MiniBatch". Taking the gradient of the whole data set is one full round of "batch" processing, as we need the whole data set on hand. Instead we will do a mini batch, selecting only a small subset of the whole data set. In this case we set k, 1 < k < m, where k is the number of points in the mini batch. We select k random data points to create the gradient from at every iteration, and then perform the update. Repeat until convergence. Obviously, increasing / decreasing k is a tradeoff between speed and accuracy.

Note: For both stochastic & mini batch gradient descent, it is important to shuffle / select randomly the next data point. If you use the same iteration order for each data point, you can get really weird / bad results - often diverging away from the solution.

Thank you! Including the extra note. I have my data all sorted in order of y = 0 then y = 1 but now I will randomize it. Thanks again! — Terence Chow, Jun 24 '13 at 23:10

score 0 · Answer 2 · answered Jun 24 '13 at 21:20

In the case of batch gradient descent (take all samples), your solution will converge faster. In the case of stochastic gradient descent (take one sample at a time) the convergence will be slower.

When the training set is not huge, use the batch gradient descent. But there are situations where the training set is not fixed. For eg. the training happens on the fly - you keep getting more and more samples and update your vector accordingly. In this case you have to update per sample.

Gradient Descent: Do we iterate on ALL of the training set with each step in GD? or Do we change GD for each training set?

2 Answers2