The algorithm as written is completely correct, but without feature scaling, convergence will be extremely slow as one feature will govern the gradient calculation.
You can perform the scaling in various ways; for now, let us just scale the features by their L^1 norms because it's simple
import numpy as np
y = np.array([[400], [330], [369], [232], [540]])
x_orig = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x_orig = np.concatenate((np.ones((5,1), dtype=np.int), x_orig), axis=1)
x_norm = np.sum(x_orig, axis=0)
x = x_orig / x_norm
That is, the sum of every column in x
is 1. If you want to retain your good guess at the correct parameters, those have to be scaled accordingly.
theta = (x_norm*[0., .1, 50.]).reshape(3, 1)
With this, we may proceed as you did in your original post, where again you will have to play around with the learning rate until you find a sweet spot.
alpha = .1
for i in range(1, 100000):
h = np.dot(x, theta)
gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
theta -= alpha * gradient
Let's see what we get now that we've found something that seems to converge. Again, your parameters will have to be scaled to relate to the original unscaled features.
print (((h - y)**2).sum(), theta.squeeze()/x_norm)
# Prints 1444.14443271 [ -7.04344646e+01 6.38435468e-02 1.03435881e+02]
At this point, let's cheat and check our results
theta, error, _, _ = np.linalg.lstsq(x_orig, y)
print(error, theta)
# Prints [ 1444.1444327] [[ -7.04346018e+01]
# [ 6.38433756e-02]
# [ 1.03436047e+02]]
A general introductory reference on feature scaling is this Stanford lecture.