I am trying to use Linear Regression on the Ames Housing dataset available on Kaggle.
I did some manual cleaning up of the data by removing many features first. Then, I used the following implementation to train.
train_size = np.shape(x_train)[0]
valid_size = np.shape(x_valid)[0]
test_size = np.shape(x_test)[0]
num_features = np.shape(x_train)[1]
graph = tf.Graph()
with graph.as_default():
# Input
tf_train_dataset = tf.constant(x_train)
tf_train_labels = tf.constant(y_train)
tf_valid_dataset = tf.constant(x_valid)
tf_test_dataset = tf.constant(x_test)
# Variables
weights = tf.Variable(tf.truncated_normal([num_features, 1]))
biases = tf.Variable(tf.zeros([1]))
# Loss Computation
train_prediction = tf.matmul(tf_train_dataset, weights) + biases
loss = tf.losses.mean_squared_error(tf_train_labels, train_prediction)
# Optimizer
# Gradient descent optimizer with learning rate = alpha
alpha = tf.constant(0.000000003, dtype=tf.float64)
optimizer = tf.train.GradientDescentOptimizer(alpha).minimize(loss)
# Predictions
valid_prediction = tf.matmul(tf_valid_dataset, weights) + biases
test_prediction = tf.matmul(tf_test_dataset, weights) + biases
This is how my graph runs:
num_steps = 10001
def accuracy(prediction, labels):
return ((prediction - labels) ** 2).mean(axis=None)
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print('Initialized')
for step in range(num_steps):
_, l, predictions = session.run([optimizer, loss, train_prediction])
if (step % 1000 == 0):
print('Loss at step %d: %f' % (step, l))
print('Validation accuracy: %.1f%%' % accuracy(valid_prediction.eval(), y_valid))
t_pred = test_prediction.eval()
print('Test accuracy: %.1f%%' % accuracy(t_pred, y_test))
Here is what I've tried:
I have tried increasing the learning rate. But, if I increase the learning rate beyond what I'm using right now, the model fails to converge i.e., the loss explodes to infinity.
Increased the number of iterations to 10,000,000. The loss converges slower the longer I iterate (which is understandable). But I'm still very far from a reasonable value. The loss is usually a 10 digit number
Am I doing something wrong with the graph? Or is linear regression a bad choice for this and I should try using another algorithm? Any help and suggestions is much appreciated!