2

Should I set shuffle=True in sklearn.model_selection.KFold ?

I'm in this situation where I'm trying to evaluate the cross_val_score of my model on a given dataset.

if I write

cross_val_score(estimator=model, X=X, y=y, cv=KFold(shuffle=False), scoring='r2')

I get back:

array([0.39577543, 0.38461982, 0.15859382, 0.3412703 , 0.47607428])

Instead, by setting

cross_val_score(estimator=model, X=X, y=y, cv=KFold(shuffle=True), scoring='r2')

I obtain:

  array([0.49701477, 0.53682238, 0.56207702, 0.56805794, 0.61073587])

So, in light of this, I want to understand if setting shuffle = True in KFold may lead obtaining over-optimistic cross validation scores.

Reading the documentation, it just says that the effect of initial shuffling just shuffles the data at the beginning, before splitting it into K-folds, training on the K-1 and testing on the one left out, and repeating for the number of folds without re-shuffling.. So, according to this, one shouldn't worry too much. Of course it the shuffle occurred at each iteration of training during cross validation, one would end up considering generalization error on points that were previously considered during training, committing a bad mistake, but is this the case?

How can I interpret the fact that in this case I get slightly better values when shuffle is True?

James Arten
  • 523
  • 5
  • 16

0 Answers0