1

How can a 1:1 stratified sampling be performed in python?

Assume the Pandas Dataframe df to be heavily imbalanced. It contains a binary group and multiple columns of categorical sub groups.

df = pd.DataFrame({'id':[1,2,3,4,5], 'group':[0,1,0,1,0], 'sub_category_1':[1,2,2,1,1], 'sub_category_2':[1,2,2,1,1], 'value':[1,2,3,1,2]})
display(df)
display(df[df.group == 1])
display(df[df.group == 0])
df.group.value_counts()

For each member of the main group==1 I need to find a single match of group==0 with.

A StratifiedShuffleSplit from scikit-learn will only return a random portion of data, not a 1:1 match.

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

1 Answers1

1

If I understood correctly you could use np.random.permutation:

import numpy as np
import pandas as pd

np.random.seed(42)

df = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'group': [0, 1, 0, 1, 0], 'sub_category_1': [1, 2, 2, 1, 1],
                   'sub_category_2': [1, 2, 2, 1, 1], 'value': [1, 2, 3, 1, 2]})

# create new column with an identifier for a combination of categories
columns = ['sub_category_1', 'sub_category_2']
labels = df.loc[:, columns].apply(lambda x: ''.join(map(str, x.values)), axis=1)
values, keys = pd.factorize(labels)
df['label'] = labels.map(dict(zip(keys, values)))

# build distribution of sub-categories combinations
distribution = df[df.group == 1].label.value_counts().to_dict()

# select from group 0 only those rows that are in the same sub-categories combinations
mask = (df.group == 0) & (df.label.isin(distribution))

# do random sampling
selected = np.ravel([np.random.permutation(group.index)[:distribution[name]] for name, group in df.loc[mask].groupby(['label'])])

# display result
result = df.drop('label', axis=1).iloc[selected]
print(result)

Output

   group  id  sub_category_1  sub_category_2  value
4      0   5               1               1      2
2      0   3               2               2      3

Note that this solution assumes the size of the each possible sub_category combination of group 1 is less than the size of the corresponding sub-group in group 0. A more robust version involves using np.random.choice with replacement:

selected = np.ravel([np.random.choice(group.index, distribution[name], replace=True) for name, group in df.loc[mask].groupby(['label'])])

The version with choice does not have the same assumption as the one with permutation, although it requires at least one element for each sub-category combination.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • Your assumption of 1 < 0 is fine. However, you only calculate a random subsample for group 0. I instead need to also consider all the sub categories. I initially concatenated all columns i.e. `group_sub_category_1_sub_category_2` to generate new classes but as mentioned got stuck with the `StratifiedShuffleSplit `. – Georg Heiler Feb 12 '19 at 15:35
  • Indeed. For both `sub_category_1` and `sub_category_2` (in fact for the real dataset it is about 10 columns). – Georg Heiler Feb 12 '19 at 15:37
  • This should be the case as well. But if possible it would be great if it is resilient enough to also handle the case if they are equal or smaller. But the first case would already be great. – Georg Heiler Feb 12 '19 at 15:39
  • @GeorgHeiler Updated the answer! – Dani Mesejo Feb 12 '19 at 16:09