Stratifying folds with StratifiedKFold in sklearn

Question

I do not understand very well the logic behind sklearn function train_test_split and StratifiedKFold for obtaining balanced splits according to multiple "columns" and not only according to the target distribution. I know the previous sentence is a bit obscure so I hope the following code helps.

import numpy as np
import pandas as pd
import random

n_samples = 100
prob = 0.2
pos = int(n_samples * prob)
neg = n_samples - pos

target = [1] * pos + [0] * neg
cat = ["a"] * 50 + ["b"] * 50
random.shuffle(target)
random.shuffle(cat)

ds = pd.DataFrame()
ds["target"] = target
ds["cat"] = cat
ds["f1"] = np.random.random(size=(n_samples,))
ds["f2"] = np.random.random(size=(n_samples,))
print(ds.head())

This is a 100-example dataset, target distribution is governed by p, in this case we have 20% positive examples. There is a binary categorical column cat, perfectly balanced. The output of the previous code is:

     target cat        f1        f2
0       0   a  0.970585  0.134268
1       0   a  0.410689  0.225524
2       0   a  0.638111  0.273830
3       0   b  0.594726  0.579668
4       0   a  0.737440  0.667996

with train_test_split(), stratify on target and cat, if we study the frequencies, we get:

from sklearn.model_selection import train_test_split, StratifiedKFold

# with train_test_split
training, valid = train_test_split(range(n_samples), 
                test_size=20, 
                stratify=ds[["target", "cat"]])

print("---")
print("* training")
print(ds.loc[training, ["target", "cat"]].value_counts() / len(training))  # balanced
print("* validation")
print(ds.loc[valid, ["target", "cat"]].value_counts() / len(valid))  # balanced

we get this:

* dataset
0    0.8
1    0.2
Name: target, dtype: float64
target  cat
0       a      0.4
        b      0.4
1       a      0.1
        b      0.1
dtype: float64
---
* training
target  cat
0       a      0.4
        b      0.4
1       a      0.1
        b      0.1
dtype: float64
* validation
target  cat
0       a      0.4
        b      0.4
1       a      0.1
        b      0.1
dtype: float64

It is perfectly stratified.

Now with StratifiedKFold:

# with stratified k-fold
skf = StratifiedKFold(n_splits=5)
try:
    for train, valid in skf.split(X=range(len(ds)), y=ds[["target", "cat"]]):
        pass
except:
    print("! does not work")


for train, valid in skf.split(X=range(len(ds)), y=ds.target):
    print("happily iterating")

output:

! does not work
happily iterating
happily iterating
happily iterating
happily iterating
happily iterating

How do I obtain what I got with train_test_split with StratifiedKFold? I know there might be data distributions not allowing such stratifications in k-fold cross validation, but I cannot understand why train_test_split accepts two or more columns and the other method does not.

score 1 · Accepted Answer · answered Mar 24 '22 at 15:12

1

This doesn't seem readily possible currently.

Multilabel isn't exactly what you're looking for, but related. That's been asked here before, and was an Issue on sklearn's github (not sure why it got closed).

As a bit of a hack, you should be able to just combine your two columns into a new one with ordered pairs, and stratify on that?

answered Mar 24 '22 at 15:12

Ben Reiniger

10,517
3
16
29

Thanks for your answer. The combination of the columns in a new one was the solution I adopted, but I was hoping my understanding of `StratifiedKFold` was wrong. – Antonio Sesto Apr 01 '22 at 09:43

Stratifying folds with StratifiedKFold in sklearn

1 Answers1