Should I set shuffle=True
in sklearn.model_selection.KFold
?
I'm in this situation where I'm trying to evaluate the cross_val_score
of my model on a given dataset.
if I write
cross_val_score(estimator=model, X=X, y=y, cv=KFold(shuffle=False), scoring='r2')
I get back:
array([0.39577543, 0.38461982, 0.15859382, 0.3412703 , 0.47607428])
Instead, by setting
cross_val_score(estimator=model, X=X, y=y, cv=KFold(shuffle=True), scoring='r2')
I obtain:
array([0.49701477, 0.53682238, 0.56207702, 0.56805794, 0.61073587])
So, in light of this, I want to understand if setting shuffle = True
in KFold
may lead obtaining over-optimistic cross validation scores.
Reading the documentation, it just says that the effect of initial shuffling just shuffles the data at the beginning, before splitting it into K-folds, training on the K-1 and testing on the one left out, and repeating for the number of folds without re-shuffling.. So, according to this, one shouldn't worry too much. Of course it the shuffle
occurred at each iteration of training during cross validation, one would end up considering generalization error on points that were previously considered during training, committing a bad mistake, but is this the case?
How can I interpret the fact that in this case I get slightly better values when shuffle
is True
?