1

As a part of my homework I was asked to implement a stochastic gradient descent in order to solve a linear regression problem (even though I have only 200 training examples). My problem is that stochastic gradient descent converges too smoothly, almost exactly as batch gradient descent, which brings me to my question: why does it look so smoothly, considering the fact that usually it's much more noisy. Is it because I use it with only 200 examples?

Convergence plots:

Stochastic gradient descent

Gradient descent

MSE with weights from stochastic gradient descent: 2.78441258841

MSE with weights from gradient descent: 2.78412631451 (identical to MSE with weights from normal equation)

My code:

def mserror(y, y_pred):

    n = y.size
    diff = y - y_pred
    diff_squared = diff ** 2
    av_er = float(sum(diff_squared))/n

    return av_er

.

def linear_prediction(X, w):
    return dot(X,np.transpose(w))

.

def gradient_descent_step(X, y, w, eta):

    n = X.shape[0]

    grad = (2.0/n) * sum(np.transpose(X) * (linear_prediction(X,w) - y), axis = 1)

    return w - eta * grad

.

def stochastic_gradient_step(X, y, w, train_ind, eta):

    n = X.shape[0]

    grad = (2.0/n) * np.transpose(X[train_ind]) * (linear_prediction(X[train_ind],w) - y[train_ind])

    return  w - eta * grad    

.

def gradient_descent(X, y, w_init, eta, max_iter):

    w = w_init
    errors = []
    errors.append(mserror(y, linear_prediction(X,w)))

    for i in range(max_iter):
        w = gradient_descent_step(X, y, w, eta)
        errors.append(mserror(y, linear_prediction(X,w)))

    return w, errors

.

def stochastic_gradient_descent(X, y, w_init, eta, max_iter):

    n = X.shape[0] 
    w = w_init

    errors = []
    errors.append(mserror(y, linear_prediction(X,w)))

    for i in range(max_iter):

        random_ind = np.random.randint(n)

        w = stochastic_gradient_step(X, y, w, random_ind, eta)
        errors.append(mserror(y, linear_prediction(X,w)))

    return w, errors
MackM
  • 2,906
  • 5
  • 31
  • 45
  • Link to the dataset in case it's relevant [link](https://d3c33hcgiwev3.cloudfront.net/_739f9073ae55f970a4924e22bcc93124_advertising.csv?Expires=1489536000&Signature=FkGFWREjxOvTnTzYIAxrJNbKE56DE~C2frqtFAQGR~7azq3I2ztYdZaFRo7zG1rWl1jtMOZDK42~NC2Az2031dokutWGDeIHp4Q6pD2yWBcL2jPijassInyTwl3974vDVJ3ewjeedB652bmoGkMcpt3YVemp5Y71SyKQOrvaB6M_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A) – Ivan Panshin Mar 13 '17 at 15:15

1 Answers1

0

There is nothing unusual with your graph. You should also note that your batch method takes fewer iterations to converge.

You may be letting SGD plots from neural networks cloud your view on what SGD "should" look like. Most neural networks are much more complicated models (difficult to optimize) working on harder problems. This contributes to the "jaggedness" you might be expecting.

Linear regression is a simple problem, and has a convex solution. That means any step that will lower our error rate is guaranteed to be a step toward the best possible solution. Thats a lot less complication than neural networks, and part of why you see a smooth error reduction. That's also why you see almost identical MSE. Both SGD and batch will converge to the exact same solution.

If you want to try and force some non-smoothness, you can keep increasing the learning rate eta, but that's kind of a silly exercise. Eventually you'll just reach a point where you don't converge because you always take steps past the solution.

Raff.Edward
  • 6,404
  • 24
  • 34
  • Yeah, I noted that my batch gradient descent converges with fewer iterations although each iteration is more expensive (I guess, n times more expensive - optimizations due to vectorization). Interesting. Not only linear regression with MSE has an analytical solution, it's also simple enough for stochastic gradient descent to converge so easily. – Ivan Panshin Mar 13 '17 at 15:40
  • That all is the general theme. SGD is often "good enough" for ML needs, which don't require many significant figures in the solution. For other domains batch methods can be more important, and there exist smarter batch methods like LBFGS that make better use of the extra information. – Raff.Edward Mar 13 '17 at 15:42
  • I used LBFGS a couple of times even though I haven't studied how it exactly works. Could you tell me what you mean by "extra information"? – Ivan Panshin Mar 13 '17 at 15:46
  • The comment section isn't an appropriate place to go into that kind of detail. The short of it is, there is more to optimization then just using the gradient - and using the whole dataset to compute a gradient gives you more information then just a single data point. You can look at this free book if you want to learn more https://stanford.edu/~boyd/cvxbook/ – Raff.Edward Mar 13 '17 at 17:05