scikit-learn: cross_val_predict only works for partitions

Question

I am struggling to work out how to implement TimeSeriesSplit in sklearn.

The suggested answer at the link below yields the same ValueError.

sklearn TimeSeriesSplit cross_val_predict only works for partitions

here the relevant bit from my code:

from sklearn.model_selection import cross_val_predict
from sklearn import svm

features = df[df.columns[0:6]]
target = df['target']

clf = svm.SVC(random_state=0)

pred = cross_val_predict(clf, features, target, cv=TimeSeriesSplit(n_splits=5).split(features))

ValueError                                Traceback (most recent call last)
<ipython-input-57-d1393cd05640> in <module>()
----> 1 pred = cross_val_predict(clf, features, target, cv=TimeSeriesSplit(n_splits=5).split(features))

/home/jedwards/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in cross_val_predict(estimator, X, y, groups, cv, n_jobs, verbose, fit_params, pre_dispatch, method)
    407 
    408     if not _check_is_permutation(test_indices, _num_samples(X)):
--> 409         raise ValueError('cross_val_predict only works for partitions')
    410 
    411     inv_test_indices = np.empty(len(test_indices), dtype=int)

ValueError: cross_val_predict only works for partitions

How to use TimeSeriesSplit with cross_val_predict in a stacking context: https://datascience.stackexchange.com/a/105116/76808 — Marco Cerliani, Dec 14 '21 at 18:04

score 12 · Accepted Answer · answered Apr 07 '17 at 13:38

12

cross_val_predict cannot work with a TimeSeriesSplit as the first partition of the TimeSeriesSplit is never a part of the test dataset, meaning there are no predictions made for it.

e.g. when your dataset is [1, 2, 3, 4, 5]

fold 1 - train: [1], test: [2]
fold 2 - train: [1, 2], test: [3]
fold 3 - train: [1, 2, 3], test: [4]
fold 4 - train: [1, 2, 3, 4], test: [5]

in none of the folds is 1 in the test set

If you want to have the predictions on 2-5, you can manually loop through the splits generated by your CV and store the predictions for 2-5 yourself.

answered Apr 07 '17 at 13:38

Matthijs Brouns

2,299
1
27
37

Thank you. The additional loops would seem a little pointless if we wish to train and test on the maximum amount of data. It would probably be easier in this case to just split the data at some arbitrary index (e.g 75% training data). Perhaps I am missing the point of the TimeSeriesSplit function. – James Edwards Apr 07 '17 at 15:37
If you would want to train on the maximum amount of data the TimeSeriesSplit will provide you with that as you could theoretically train your model on all but one observations in the final fold. The main reason for using a TimeSeriesSplit is when you can't use e.g. leave-k-out crossvalidation as that would leave information due to correlations with the other observations. – Matthijs Brouns Apr 10 '17 at 07:27

scikit-learn: cross_val_predict only works for partitions

1 Answers1