3

I'm trying to understand how to use the cross-validation function sklearn.model_selection.KFold. If I define (like in this tutorial)

from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=False, random_state=100)

I get

ValueError: Setting a random_state has no effect since shuffle is False.
You should leave random_state to its default (None), or set shuffle=True. 

What does this error mean and why is it necessary to set random_state=None or shuffle=True?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Medulla Oblongata
  • 3,771
  • 8
  • 36
  • 75

2 Answers2

5

Shuffling in this context means that the data is first randomly shuffled before splitting into test/train. The random_state will allow the way in which the data is shuffled to be repeatable. Without the shuffling switched on, the random_state has no meaning.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Taylrl
  • 3,601
  • 6
  • 33
  • 44
  • Thanks, what references do you recommend for learning more on data preprocessing? I don't find the sklearn docs that helpful. – Medulla Oblongata Jun 28 '21 at 20:57
  • 1
    That's the correct answer indeed, although I confess I am puzzled why the sklearn designers decided to throw an error in this case; arguably a warning would be more than enough. – desertnaut Jun 29 '21 at 09:40
  • Thanks! This worked for me. kfold = KFold(n_splits=10, random_state=10, shuffle=True) – Sanushi Salgado Apr 09 '22 at 11:04
3

By default in kfold shuffle=False, by putting random_state to value, you need to activate shuffle, shuffle=True, which will work.

Example:

k_fold = model_selection.KFold(n_splits=10,shuffle=True, random_state=10)
RiveN
  • 2,595
  • 11
  • 13
  • 26
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 03 '22 at 00:53