2

I'm trying to write some very simple LMS batch gradient descent but I believe I'm doing something wrong with the gradient. The ratio between the order of magnitude and the initial values for theta is very different for the elements of theta so either theta[2] doesn't move (e.g. if alpha = 1e-8) or theta[1] shoots off (e.g. if alpha = .01).

import numpy as np

y = np.array([[400], [330], [369], [232], [540]])
x = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x = np.concatenate((np.ones((5,1), dtype=np.int), x), axis=1)
theta = np.array([[0.], [.1], [50.]])
alpha = .01

for i in range(1,1000):
    h = np.dot(x, theta)
    gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
    theta -= alpha * gradient
    print ((h - y)**2).sum(), theta.squeeze().tolist()
fuglede
  • 17,388
  • 2
  • 54
  • 99
wizplum
  • 427
  • 5
  • 17

1 Answers1

2

The algorithm as written is completely correct, but without feature scaling, convergence will be extremely slow as one feature will govern the gradient calculation.

You can perform the scaling in various ways; for now, let us just scale the features by their L^1 norms because it's simple

import numpy as np
y = np.array([[400], [330], [369], [232], [540]])
x_orig = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x_orig = np.concatenate((np.ones((5,1), dtype=np.int), x_orig), axis=1)
x_norm = np.sum(x_orig, axis=0)
x = x_orig / x_norm

That is, the sum of every column in x is 1. If you want to retain your good guess at the correct parameters, those have to be scaled accordingly.

theta = (x_norm*[0., .1, 50.]).reshape(3, 1)

With this, we may proceed as you did in your original post, where again you will have to play around with the learning rate until you find a sweet spot.

alpha = .1
for i in range(1, 100000):
    h = np.dot(x, theta)
    gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
    theta -= alpha * gradient

Let's see what we get now that we've found something that seems to converge. Again, your parameters will have to be scaled to relate to the original unscaled features.

print (((h - y)**2).sum(), theta.squeeze()/x_norm)
# Prints 1444.14443271 [ -7.04344646e+01   6.38435468e-02   1.03435881e+02]

At this point, let's cheat and check our results

theta, error, _, _ = np.linalg.lstsq(x_orig, y)
print(error, theta)
# Prints [ 1444.1444327] [[ -7.04346018e+01]
#                         [  6.38433756e-02]
#                         [  1.03436047e+02]]

A general introductory reference on feature scaling is this Stanford lecture.

fuglede
  • 17,388
  • 2
  • 54
  • 99
  • Thanks, didn't know about feature scaling! The code worked, the only thing was that I to do `x = np.asfarray([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])` because the array was all `int` and in the scaling it was all being rounded down to 0. – wizplum May 30 '17 at 16:51
  • Ah, for the record, that behavior does not occur in Python 3, but I should have guessed from your final line that you were running Python 2. – fuglede May 30 '17 at 17:49