train test data split using stratify on two columns in scikit-learn

Question

I have a dataset that I want to split into train and test so that I have data in the test set from each data source (specified in column "source") and from each class (specified in column "class"). I read about using the parameter stratifiy with scikitlearn's train_test_split function, but how can I use it on two columns?

you need to write your own wrapper for this, currently this functionality is not available in sklearn. — Venkatachalam, Mar 10 '20 at 11:07

Sergey Bushmanov · Answer 1 · 2020-03-10T22:50:30.923

Stratifying on multiple columns is easily done with sklearn's train_test_split since v.19.0

Proof

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_multilabel_classification

X, Y = make_multilabel_classification(1000000, 10, n_classes=2, n_labels=1)
train_X, test_X, train_Y, test_Y =train_test_split(X,Y,stratify=Y, train_size=.8, random_state=42)
Y.shape

(1000000, 2)

Then you can compare simple column means of resulting stratifications:

train_Y[:,0].mean(), test_Y[:,0].mean()
(0.45422, 0.45422)

train_Y[:,1].mean(), test_Y[:,1].mean()
(0.23472375, 0.234725)

Run statistical t-test on the equality of means:

from scipy.stats import ttest_ind
ttest_ind(train_Y[:,0],test_Y[:,0])

Ttest_indResult(statistic=0.0, pvalue=1.0)

And finally do the same for conditional means to prove that you indeed achieved what you wanted:

train_Y[train_Y[:,0].astype("bool"),1].mean(), test_Y[test_Y[:,0].astype("bool"),1].mean()
(0.43959149751221877, 0.43958874554180793)

train test data split using stratify on two columns in scikit-learn

1 Answers1