1

could anyone explain the difference between a "normal" k-fold cross-validation using the shuffle function, e.g.

kf = KFold(n_splits = 5, shuffle = True)

and a repeated k-fold cross-validation? Shouldn't they return the same results?

Having a hard time understanding the difference.

Any hint is appreciated.

JKnow
  • 21
  • 5

1 Answers1

3

As its name says, RepeatedKFold is a repeated KFold. It executes it n_repeats times. When n_repeats=1, the former performs exactly as the latter when shuffle=True. They do not return the same splits because random_state=None by default, that is, you did not specify it. Therefore, they use different seeds to (pseudo-)randomly shuffle data.

When they have the same random_state and are repeated once, then both lead the same splits. For a deeper understanding try the following:

import pandas as pd
from sklearn.model_selection import KFold, RepeatedKFold
                     
data = pd.DataFrame([['red', 'strawberry'], # color, fruit
                  ['red', 'strawberry'], 
                  ['red', 'strawberry'],
                  ['red', 'strawberry'],
                  ['red', 'strawberry'],
                  ['yellow', 'banana'],
                  ['yellow', 'banana'],
                  ['yellow', 'banana'],
                  ['yellow', 'banana'],
                  ['yellow', 'banana']])

X = data[0]

# KFold
for train_index, test_index in KFold(n_splits=2, shuffle=True, random_state=1).split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

# RepeatedKFold
for train_index, test_index in RepeatedKFold(n_splits=2, n_repeats=1, random_state=1).split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

You should obtain the following:

TRAIN: [1 3 5 7 8] TEST: [0 2 4 6 9]
TRAIN: [0 2 4 6 9] TEST: [1 3 5 7 8]

TRAIN: [1 3 5 7 8] TEST: [0 2 4 6 9]
TRAIN: [0 2 4 6 9] TEST: [1 3 5 7 8]
s.dallapalma
  • 1,225
  • 1
  • 12
  • 35