0

I use StratifiedKFold and a form of grid search for my Logistic Regression.

skf = StratifiedKFold(n_splits=6, shuffle=True, random_state=SEED)

I call this for loop for each combination of parameters:

for fold, (trn_idx, test_idx) in enumerate(skf.split(X, y)):

My question is, are trn_idx and test_idx the same for each fold every time I run the loop?

For example, if fold0 contains trn_dx = [1,2,5,7,8] and test_idx = [3,4,6], is fold0 going to contain the same trn_idx and test_idx the next 5 times I run the loop?

Yana
  • 785
  • 8
  • 23

1 Answers1

1

Yes, the stratified k-fold split is fixed if random_state=SEED is fixed. The shuffle only shuffles the dataset along with their targets before the k-fold split.

This means that each fold will always have their indexes:


x = list(range(10))
y = [1]*5 + [2]*5

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

for fold, (trn_idx, test_idx) in enumerate(skf.split(x, y)):
    print(trn_idx, test_idx)

Output:

[1 2 4 5 7 9] [0 3 6 8]
[0 1 3 5 6 8 9] [2 4 7]
[0 2 3 4 6 7 8] [1 5 9]

No matter how may times I run this code.

SystemSigma_
  • 1,059
  • 4
  • 18