I do not understand very well the logic behind sklearn
function train_test_split
and StratifiedKFold
for obtaining balanced splits according to multiple "columns" and not only according to the target distribution. I know the previous sentence is a bit obscure so I hope the following code helps.
import numpy as np
import pandas as pd
import random
n_samples = 100
prob = 0.2
pos = int(n_samples * prob)
neg = n_samples - pos
target = [1] * pos + [0] * neg
cat = ["a"] * 50 + ["b"] * 50
random.shuffle(target)
random.shuffle(cat)
ds = pd.DataFrame()
ds["target"] = target
ds["cat"] = cat
ds["f1"] = np.random.random(size=(n_samples,))
ds["f2"] = np.random.random(size=(n_samples,))
print(ds.head())
This is a 100-example dataset, target distribution is governed by p
, in this case we have 20% positive examples. There is a binary categorical column cat
, perfectly balanced. The output of the previous code is:
target cat f1 f2
0 0 a 0.970585 0.134268
1 0 a 0.410689 0.225524
2 0 a 0.638111 0.273830
3 0 b 0.594726 0.579668
4 0 a 0.737440 0.667996
with train_test_split()
, stratify
on target
and cat
, if we study the frequencies, we get:
from sklearn.model_selection import train_test_split, StratifiedKFold
# with train_test_split
training, valid = train_test_split(range(n_samples),
test_size=20,
stratify=ds[["target", "cat"]])
print("---")
print("* training")
print(ds.loc[training, ["target", "cat"]].value_counts() / len(training)) # balanced
print("* validation")
print(ds.loc[valid, ["target", "cat"]].value_counts() / len(valid)) # balanced
we get this:
* dataset
0 0.8
1 0.2
Name: target, dtype: float64
target cat
0 a 0.4
b 0.4
1 a 0.1
b 0.1
dtype: float64
---
* training
target cat
0 a 0.4
b 0.4
1 a 0.1
b 0.1
dtype: float64
* validation
target cat
0 a 0.4
b 0.4
1 a 0.1
b 0.1
dtype: float64
It is perfectly stratified.
Now with StratifiedKFold
:
# with stratified k-fold
skf = StratifiedKFold(n_splits=5)
try:
for train, valid in skf.split(X=range(len(ds)), y=ds[["target", "cat"]]):
pass
except:
print("! does not work")
for train, valid in skf.split(X=range(len(ds)), y=ds.target):
print("happily iterating")
output:
! does not work
happily iterating
happily iterating
happily iterating
happily iterating
happily iterating
How do I obtain what I got with train_test_split
with StratifiedKFold
? I know there might be data distributions not allowing such stratifications in k-fold cross validation, but I cannot understand why train_test_split
accepts two or more columns and the other method does not.