Linear regression accuracy 95%, but predicts past data

Question

Having a pandas dataframe of 4 rows of features, I create labels for them from "forecast_col" and shift them back to the past to make prediction later:

pandasdf['label'] = pandasdf[forecast_col].shift(-forecast_out)

Taking all the rows except the 'label' column:

X = np.array(pandasdf.drop(['label'], 1))

Normalizing data:

X = preprocessing.scale(X)

Taking last rows for future prediction:

X_lately = X[-forecast_out:]

Selecting data for training and cross-validation:

X = X[:-forecast_out]
y = np.array(pandasdf['label'])[:-forecast_out] 
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.3)

Training classifier:

clf = LinearRegression(n_jobs=-1)
clf.fit(X_train, y_train)

Checking accuracy - it's around 95%: accuracy = clf.score(X_test, y_test)

Forecasting on the last data:

forecast_set = clf.predict(X_lately)

Here I should get the list of future prices for "forecast_out" periods, but I'm getting forecast for the same last data (X_lately) prices

Here's the example: forecasting the past

What am I doing wrong?

The charts look different. Does not look like you get a forecast for the same last data, no. — Maxim Egorushkin, Feb 20 '17 at 10:58
Trust me, I'm trying this on 125 different data sets, they are all the same. It's ~90% accuracy that makes them look "different". Take a look here for the various charts: https://yadi.sk/d/2P7TSsHC3BfnPa — Александр Нагорный, Feb 20 '17 at 11:07
I've tried to replicate your code and got the similar result. I have also compared both [training data](http://imgur.com/a/MciGf) and [testing data](http://imgur.com/a/lntJm) label vs prediction plots and confirmed that they reflect the high accuracy score. So what else makes you think that something is wrong? — Yohanes Gultom, Feb 20 '17 at 14:12
As I said earlier, the problem is that the algorithm should return prediction for the next 14 prices (i.e., from today 14 days ahead), but it returns prediction for the last 14 prices. Again, here's the example: https://i.stack.imgur.com/VXs0z.png - forecasted 14 prices are with 95% accuracy equals to last 14 prices. Here's my full script - JFYI https://yadi.sk/d/PQehRqpI3EJgHz — Александр Нагорный, Feb 20 '17 at 15:39
To clear things up: X_lately contains prices for last 14 days (till today). Algo should predict future 14 days, but it predicts same last 14 days. And that's the problem. — Александр Нагорный, Feb 20 '17 at 15:48
BTW, to prove something's wrong with the algo, try to drop more than just 'label' column (even the close price itself: X = np.array(pandasdf.drop(['label','close'], 1))) - accuracy will not drop! — Александр Нагорный, Feb 20 '17 at 16:13
@АлександрНагорный Did you find the issue ? I am facing the same problem..Thx — Amit, Dec 23 '17 at 16:25
Nope. I thought shifting data for more than one row will help, but it did not - still "predicting" existing data with 95..98% accuracy. — Александр Нагорный, Jan 14 '18 at 20:09

Linear regression accuracy 95%, but predicts past data

0 Answers0