Why Adam fails but SGD works when there is only one variable has high correlation with the label?

Question

The task is to predict the stock price (y) in the next day, and the input (X) consists of one variable 'today's price' and three other variables about stock's fundamental information. I have 1000 days as training data, and 300 days as the test data. All four variables in X are well scaled with range from -1 to 1. Then I use linear regression to do three experiments:

Experiment1: solve linear regression by its mathematical closed-form solution on the training data.

Experiment2: train linear regression by gradient descent method with optim.SGD on the training data.

Experiment3: train linear regression by gradient descent method with optim.Adam on the training data.

My observations are:

Experiment1 and Experiment2 almost get the same accuracy on the test data. Also, the weight of the variable 'today's price' is much larger that weights of other variables, which satisfies our intuition.

Experiment3 gets much worse accuracy on both the training and test data, which seems to learn nothing from the data.

Howerer, Experiment3 should perform similar with or even better than Experiment2. How to solve my confusion?

Why Adam fails but SGD works when there is only one variable has high correlation with the label?

0 Answers0