I was given already filtered datasets. The request is to create 5-6 equally sized groups that are balanced/stratified across 3 different variables. I have two datasets to do this for, one with about 540 rows and the other with about 880 rows. The relevant columns are:
- AreaID - unique ID for geographic region
- Region - one of region names (A, B, C, D) one in particular is underrepresented
- Decile - an already calculated decile ranking based on another variable (range 1-10)
- Population - population of said geographic region (ranges from 0 to over a million)
- PopGroup - I created four labels for population (0-50K, 50k-100k, 100k-1M, 1M+)
This isn't for machine learning. I need 5-6 groups so that deciles, regions and population are equally distributed across groups. I added the PopGroup based on previous threads stating that continuous variables should be binned. Based on my bin cutoffs, the 4 groups are obviously imbalanced. I then created a stratifying variable on the concatenation of the popgroup and region. Adding deciles to the stratifying variable would only create even more groups to stratify on an already small dataset and I hoped the mostly uniform distribution of deciles would randomly be split evenly. As it is, I had to drop one whole region (n=6) and one region_popgroup label from each respective dataset because of errors about only having one sample in a split group.
df['region_popgroup'] = df['Region'] + "_" + df['PopGroup'].astype(str)
After looking at the various splitting methods: StratifiedShuffleSplit, StratifiedKFold, GroupShuffleSplit, GroupKFold, ShuffleSplit, MultilabelStratifiedKFold, I went with the following:
train1, test1 = train_test_split(df, test_size=0.2, random_state=42, stratify=df[['region_popgroup']])
train2, test2 = train_test_split(train1, test_size=0.25, random_state=42, stratify=train1[['region_popgroup']])
train3, test3 = train_test_split(train2, test_size=0.33, random_state=42, stratify=train2[['region_popgroup']])
test4, test5 = train_test_split(train3, test_size=0.5, random_state=42, stratify=train3[['region_popgroup']])
testgroups = pd.concat([test1, test2, test3, test4, test5], axis=0)
I iteratively changed the test_size, taking the leftover train set from one split to be the source for the next split, so that I'd have 5 similarly sized groups. The deciles didn't always get equal distribution but were very even except for a couple of decile labels. I think the regions and pop groups ended up good enough.
Is there a better solution? Currently I had to drop the drastically underrepresented groups that need to be manually assigned later. I'd prefer a solution that eliminated the manual group picking. Also, I would have to add a first row with test_size=0.167, if the request really preferred 6 groups. So, is there a method that allows me to choose and compare the results for a number of groups/folds (ie n_splits=5). Also, is there a better way to handle the population size? Is there a method that allows stratification on a continuous variable or should I bin the populations differently? I thought this would've been more straightforward. I appreciate any help. Thanks.