0

I did a code to predict Y values, X and Y are arrays of the same lenght

from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

plt.scatter(X,Y,1)
regr2 = make_pipeline(PolynomialFeatures(10), Ridge())
regr2 =regr2.fit(X[:,np.newaxis], Y)
y_pred=regr2.predict(X[:,np.newaxis])
plt.plot(X, y_pred, color='red')
plt.show()

It works and it is a good approximation But when I do it with test values and train values it shows an exponential when I plot it which it is not supposed to do.

In fact the y_pred1 is the X_test plus a small decimal number

plt.scatter(X_test,Y_test,1)

X_train=X[0:int(0.8*len(X))]
X_test=X[int(0.8*len(X)):]
Y_train=Y[0:int(0.8*len(X))]
Y_test=Y[int(0.8*len(X)):]

regr3 = make_pipeline(PolynomialFeatures(10), Ridge())
regr3 =regr3.fit(X_train[:,np.newaxis], Y_train)
y_pred1=regr3.predict(X_test[:,np.newaxis])
plt.plot(X_test, y_pred1, color='red')
plt.show()

I tried several things, even testing the prediction with the train values and in this case too it plot an exponential instead of an approximation of the points.

Thank in advance!

Antoine -
  • 1
  • 1
  • `plt.plot()` is a line plotting function. Did you want `plt.scatter()`? – G. Anderson Nov 02 '18 at 17:06
  • Is there a reason you're trying to do the train/test split manually instead of using sklearn's `train_test_split()`? Your method (in addition to being written wrong as pointed out by @Qudus) will not do any random selection, which will be a problem if your array isn't already randomized – G. Anderson Nov 02 '18 at 17:19
  • Using plot was an error but replacing it did not solve the problem. I did this because I want to predict something in time so I need to take the first 80% as training values and the last 20% as test values. Why will it be a problem? – Antoine - Nov 03 '18 at 10:53
  • I understand now, I wasn't aware that it's a time-series problem. If it were not time series, that would not be a good way to split your train-test sets, which is why I mentioned it. Given that, time series can be tricky, as it tends to do as you stated and keep extrapolating in a certain direction. You may be able to do cross validation with [time series split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html) to get a better estimator, or try different models – G. Anderson Nov 05 '18 at 15:47
  • [Here](https://stackoverflow.com/questions/20841167/how-to-predict-time-series-in-scikit-learn) is a discussion on doing time-series in pandas and scikit – G. Anderson Nov 05 '18 at 15:49
  • Honestly, a large portion of your problem would be solved if you just used sklearn's built-in `train_test_split()` method – Aayush Panda Oct 08 '20 at 00:49

1 Answers1

0

Fix Y_train

Y_train=Y[0:int(0.8*len(X))]
Qudus
  • 1,440
  • 2
  • 13
  • 22