In sklearn.model_selection.cross_validate
, is there a way to output the samples / indices which were used as test set by the CV splitter for each fold?

- 304
- 1
- 8
1 Answers
There's an option to specify the cross-validation generator, using cv
option :
cv int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are:
None, to use the default 5-fold cross validation,
int, to specify the number of folds in a (Stratified)KFold,
CV splitter,
An iterable yielding (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.
If you provide it as an input to cross_validate
:
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
from sklearn.svm import LinearSVC
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
lasso = linear_model.Lasso()
kf = KFold(5, random_state = 99, shuffle = True)
cv_results = cross_validate(lasso, X, y, cv=kf)
You can extract the index like this:
idx = [test_index for train_index, test_index in kf.split(X)]
Where the first in the list will be the test index for the 1st fold and so on..

- 304
- 1
- 8

- 45,075
- 17
- 40
- 72
-
If there is no direct way to tell `cross_validate` to save the test set indices, I think this is the approach we are left with (i.e. re-running the splitting after the actual CV, to obtain the splits). However, there is a slight problem in the code: using random_state without shuffle won't work. – roble Feb 09 '23 at 18:16
-
...added the shuffle argument – roble Feb 09 '23 at 23:04