Train test split for ensuring all categories are included in train set

Question

Let's say there are some 20 categorical columns in the data, each having a different set of unique categorical values. Now a train test split has to done, and one needs to ensure that all unique categories are included in the train set. How can it be done? I have not tried yet, but should all these columns be included in the stratify argument?

score 3 · Answer 1 · answered Dec 06 '20 at 07:01

Yes. That's correct.

For demonstration, I'm using Melbourne Housing Dataset.

import pandas as pd
from sklearn.model_selection import train_test_split

Meta = pd.read_csv('melb_data.csv')
Meta = Meta[["Rooms", "Type", "Method", "Bathroom"]]
print(Meta.head())

print("\nBefore split -- Method feature distribution\n")
print(Meta.Method.value_counts(normalize=True))
print("\nBefore split -- Type feature distribution\n")
print(Meta.Type.value_counts(normalize=True))

train, test = train_test_split(Meta, test_size = 0.2, stratify=Meta[["Method", "Type"]])

print("\nAfter split -- Method feature distribution\n")
print(train.Method.value_counts(normalize=True))
print("\nAfter split -- Type feature distribution\n")
print(train.Type.value_counts(normalize=True))

Output

Rooms Type Method  Bathroom
0      2    h      S       1.0
1      2    h      S       1.0
2      3    h     SP       2.0
3      3    h     PI       2.0
4      4    h     VB       1.0

Before split -- Method feature distribution

S     0.664359
SP    0.125405
PI    0.115169
VB    0.088292
SA    0.006775
Name: Method, dtype: float64

Before split -- Type feature distribution

h    0.695803
u    0.222165
t    0.082032
Name: Type, dtype: float64

After split -- Method feature distribution

S     0.664396
SP    0.125368
PI    0.115151
VB    0.088273
SA    0.006811
Name: Method, dtype: float64

After split -- Type feature distribution

h    0.695784
u    0.222202
t    0.082014
Name: Type, dtype: float64

score 0 · Answer 2 · answered Nov 15 '21 at 12:09

you want all categories from all categorical variables to be in your train split.

Using :

train, test = train_test_split(Meta, test_size = 0.2, stratify=Meta[["Method", "Type"]])

ensure that all categories are in the train split and test split. This is more than what you want.

It has to be noticed that the more categorical variables you stratify on, the more probable it is that a combination of categories has only one record associated. If that case occurs, the split won't be done.

Error message :

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

Train test split for ensuring all categories are included in train set

2 Answers2