During Stochastic Gradient Descent, what's the differences between these two updating hypothese ways?

Question

I have a question about updating the theta during the Stochastic GD. I have two ways to update theta:

1) Use the previous theta, to get all the hypotheses for all samples, and then update the theta by each sample. Like:

hypothese = np.dot(X, theta)
for i in range(0, m):
    theta = theta + alpha * (y[i] - hypothese[i]) * X[i]

2) Another way: during the scan the samples, update the hypothese[i] using the latest theta. Like:

for i in range(0, m):
    h = np.dot(X[i], theta)
    theta = theta + alpha * (y[i] - h) * X[i]

I checked the SGD code, it seems the second way is correct. But during my coding, the first one will converge faster and the result is better than the second. Why the wrong way can perform better than the correct way?

I also attached the completed code as following:

def SGD_method1():
maxIter = 100 # max iterations
alpha = 1e4 # learning rate
m, n = np.shape(X)  # X[m,n], m:#samples, n:#features
theta = np.zeros(n) # initial theta
for iter in range(0, maxIter):
    hypothese = np.dot(X, theta)  # update all the hypoes using the same theta
    for i in range(0, m):
        theta = theta + alpha * (y[i] - hypothese[i]) * X[i]
return theta

def SGD_method2():
maxIter = 100 # max iterations
alpha = 1e4 # learning rate
m, n = np.shape(X)  # X[m,n], m:#samples, n:#features
theta = np.zeros(n) # initial theta
for iter in range(0, maxIter):
    for i in range(0, m):
        h = np.dot(X[i], theta)  #  update on hypo using the latest theta
        theta = theta + alpha * (y[i] -h) * X[i]
return theta

The first is not stochastic, because you always look at all examples, not just one. — Thomas Jungblut, May 29 '14 at 19:39
For a good overview of improved SGDs by experts in the field, see [yaroslavvb.blogspot.../stochastic-gradient-methods-2014](http://yaroslavvb.blogspot.de/2014/03/stochastic-gradient-methods-2014.html) . — denis, Jun 15 '14 at 16:48

score 0 · Accepted Answer · answered May 29 '14 at 19:58

The first code is not SGD. It is a "Traditional" (batch) Gradient Descent. The stochasticity comes from updates based on gradient computed for one samples (or small batch, which is called mini-bach SGD). It is obviously not a correct gradient of the error function (which is the sum of errors for all the training samples) but one can prove, that under reasonable conditions such process converges to the local optima. The stochastic updates are preferable in many applications due to their simplicity and (in many cases) cheapier computation . Both algoirhtms are correct (both, under reasonable assumptions, guarantee convergance to the local optima), the choice of particular strategy depends on a particular problem (especially its size and other requirements).

During Stochastic Gradient Descent, what's the differences between these two updating hypothese ways?

1 Answers1