2

sklearn's train_test_split , StratifiedShuffleSplit and StratifiedKFold all stratify based on class labels (y-variable or target_column). What if we want to sample based on features columns (x-variables) and not on target column. If it was just one feature it would be easy to stratify based on that single column, but what if there are many feature columns and we want to preserve the population's proportions in the selected sample?

Below I created a df which has skewed population with more people of low income, more females, least people from CA and most people from MA. I want the selected sample to have these characteristics i.e. more people of low income, more females, least people from CA and most people from MA

import random
import string
import pandas as pd
N = 20000 # Total rows in data
names    = [''.join(random.choices(string.ascii_uppercase, k = 5)) for _ in range(N)]
incomes  = [random.choices(['High','Low'], weights=(30, 70))[0] for _ in range(N)]
genders  = [random.choices(['M','F'], weights=(40, 60))[0] for _ in range(N)]
states   = [random.choices(['CA','IL','FL','MA'], weights=(10,20,30,40))[0] for _ in range(N)]
targets_y= [random.choice([0,1]) for _ in range(N)]

df = pd.DataFrame(dict(
        name     = names,
        income   = incomes,
        gender   = genders,
        state    = states,
        target_y = targets_y
    ))

One more complexity arises when for some of the characteristics, we have very few examples and we want to include atleast n examples in selected sample. Consider this example:

single_row = {'name' : 'ABC',
'income' : 'High',
'gender' : 'F',
'state' : 'NY',
'target_y' : 1}

df = df.append(single_row, ignore_index=True)

df

enter image description here

.

I want this single added row to be always included in test-split (n=1 here).

Abhi25t
  • 3,703
  • 3
  • 19
  • 32

1 Answers1

1

This can be achieved using pandas groupby:

Let us first check the population characteristics:

grps = df.groupby(['state','income','gender'], group_keys=False)
grps.count()

enter image description here

Next let's create a test set with 20% of original data

test_proportion = 0.2
at_least = 1
test = grps.apply(lambda x: x.sample(max(round(len(x)*test_proportion), at_least)))
test

enter image description here

test-set characteristics:

test.groupby(['state','income','gender']).count()

enter image description here

Next we create the train-set as a difference of original df and test-set

print('No. of samples in test  =', len(test))
train = set(df.name) - set(test.name)
print('No. of samples in train =', len(train))

No. of samples in test = 4000

No. of samples in train = 16001

Abhi25t
  • 3,703
  • 3
  • 19
  • 32