Problem predicting test values with python sklearn

Question

I did a code to predict Y values, X and Y are arrays of the same lenght

from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

plt.scatter(X,Y,1)
regr2 = make_pipeline(PolynomialFeatures(10), Ridge())
regr2 =regr2.fit(X[:,np.newaxis], Y)
y_pred=regr2.predict(X[:,np.newaxis])
plt.plot(X, y_pred, color='red')
plt.show()

It works and it is a good approximation But when I do it with test values and train values it shows an exponential when I plot it which it is not supposed to do.

In fact the y_pred1 is the X_test plus a small decimal number

plt.scatter(X_test,Y_test,1)

X_train=X[0:int(0.8*len(X))]
X_test=X[int(0.8*len(X)):]
Y_train=Y[0:int(0.8*len(X))]
Y_test=Y[int(0.8*len(X)):]

regr3 = make_pipeline(PolynomialFeatures(10), Ridge())
regr3 =regr3.fit(X_train[:,np.newaxis], Y_train)
y_pred1=regr3.predict(X_test[:,np.newaxis])
plt.plot(X_test, y_pred1, color='red')
plt.show()

I tried several things, even testing the prediction with the train values and in this case too it plot an exponential instead of an approximation of the points.

Thank in advance!

`plt.plot()` is a line plotting function. Did you want `plt.scatter()`? — G. Anderson, Nov 02 '18 at 17:06
Is there a reason you're trying to do the train/test split manually instead of using sklearn's `train_test_split()`? Your method (in addition to being written wrong as pointed out by @Qudus) will not do any random selection, which will be a problem if your array isn't already randomized — G. Anderson, Nov 02 '18 at 17:19
Using plot was an error but replacing it did not solve the problem. I did this because I want to predict something in time so I need to take the first 80% as training values and the last 20% as test values. Why will it be a problem? — Antoine -, Nov 03 '18 at 10:53
I understand now, I wasn't aware that it's a time-series problem. If it were not time series, that would not be a good way to split your train-test sets, which is why I mentioned it. Given that, time series can be tricky, as it tends to do as you stated and keep extrapolating in a certain direction. You may be able to do cross validation with [time series split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html) to get a better estimator, or try different models — G. Anderson, Nov 05 '18 at 15:47
[Here](https://stackoverflow.com/questions/20841167/how-to-predict-time-series-in-scikit-learn) is a discussion on doing time-series in pandas and scikit — G. Anderson, Nov 05 '18 at 15:49
Honestly, a large portion of your problem would be solved if you just used sklearn's built-in `train_test_split()` method — Aayush Panda, Oct 08 '20 at 00:49

Qudus · Answer 1 · 2018-11-02T17:25:05.113

0

Fix Y_train

Y_train=Y[0:int(0.8*len(X))]

edited Nov 02 '18 at 17:25

answered Nov 02 '18 at 17:10

Qudus

1,440
2
13
22

I changed it but still have a wrong approximation. Now the curve look like an exponential. – Antoine - Nov 03 '18 at 10:57

Problem predicting test values with python sklearn

1 Answers1