Complex dataset split - StratifiedGroupShuffleSplit

Question

I have a dataset of ~2m observations which I need to split into training, validation and test sets in the ratio 60:20:20. A simplified excerpt of my dataset looks like this:

+---------+------------+-----------+-----------+
| note_id | subject_id | category  |   note    |
+---------+------------+-----------+-----------+
|       1 |          1 | ECG       | blah ...  |
|       2 |          1 | Discharge | blah ...  |
|       3 |          1 | Nursing   | blah ...  |
|       4 |          2 | Nursing   | blah ...  |
|       5 |          2 | Nursing   | blah ...  |
|       6 |          3 | ECG       | blah ...  |
+---------+------------+-----------+-----------+

There are multiple categories - which are not evenly balanced - so I need to ensure that the training, validation and test sets all have the same proportions of categories as in the original dataset. This part is fine, I can just use StratifiedShuffleSplit from the sklearn library.

However, I also need to ensure that the observations from each subject are not split across the training, validation and test datasets. All the observations from a given subject need to be in the same bucket to ensure my trained model has never seen the subject before when it comes to validation/testing. E.g. every observation of subject_id 1 should be in the training set.

I can't think of a way to ensure a stratified split by category, prevent contamination (for want of a better word) of subject_id across datasets, ensure a 60:20:20 split and ensure that the dataset is somehow shuffled. Any help would be appreciated!

Thanks!

EDIT:

I've now learnt that grouping by a category and keeping groups together across dataset splits can also be accomplished by sklearn through the GroupShuffleSplit function. So essentially, what I need is a combined stratified and grouped shuffle split i.e. StratifiedGroupShuffleSplit which does not exist. Github issue: https://github.com/scikit-learn/scikit-learn/issues/12076

score 6 · Answer 1 · answered Oct 14 '21 at 18:12

This is solved in scikit-learn 1.0 with StratifiedGroupKFold

In this example you generate 3 folds after shuffling, keeping groups together and does stratification (as much as possible)

import numpy as np
from sklearn.model_selection import StratifiedGroupKFold

X = np.ones((30, 2))
y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0,
              0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
              1, 1, 1, 0, 0, 0, 0, 1, 1, 1,])
groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5,
                   5, 5, 6, 6, 7, 8, 8, 9, 9, 9,
                   10, 11, 11, 12, 12, 12, 13, 13,
                   13, 13])
print("ORIGINAL POSITIVE RATIO:", y.mean())
cv = StratifiedGroupKFold(n_splits=3, shuffle=True)
for fold, (train_idxs, test_idxs) in enumerate(cv.split(X, y, groups)):
    print("Fold :", fold)
    print("TRAIN POSITIVE RATIO:", y[train_idxs].mean())
    print("TEST POSITIVE RATIO :", y[test_idxs].mean())
    print("TRAIN GROUPS        :", set(groups[train_idxs]))
    print("TEST GROUPS         :", set(groups[test_idxs]))

In the output you can see that the ratio of positives cases in the folds stays close to the original positive ratio and that the same group is never in both sets. Of course the fewer/bigger groups you have (i.e., the more imbalanced your classes are) the more difficult will be to stay close to the original classes distribution.

Output:

ORIGINAL POSITIVE RATIO: 0.5
Fold : 0
TRAIN POSITIVE RATIO: 0.4375
TEST POSITIVE RATIO : 0.5714285714285714
TRAIN GROUPS        : {1, 3, 4, 5, 6, 7, 10, 11}
TEST GROUPS         : {2, 8, 9, 12, 13}
Fold : 1
TRAIN POSITIVE RATIO: 0.5
TEST POSITIVE RATIO : 0.5
TRAIN GROUPS        : {2, 4, 5, 7, 8, 9, 11, 12, 13}
TEST GROUPS         : {1, 10, 3, 6}
Fold : 2
TRAIN POSITIVE RATIO: 0.5454545454545454
TEST POSITIVE RATIO : 0.375
TRAIN GROUPS        : {1, 2, 3, 6, 8, 9, 10, 12, 13}
TEST GROUPS         : {11, 4, 5, 7}

This is very close to what I needed but is only for Cross Validation (i.e. each split is the same size rather than configurable to e.g. 60:20:20) — amin_nejad, Mar 28 '22 at 21:40
Yes, this is for cross-validation. Still you can get one single split with your desired ratio by abusing it a little bit: Set `n_splits=int(1/desired_test_ratio)`. After that you can randomly choose any of the created folds. If you want also a training-validation split, you can repeat the process within the training split with the corresponding `desired_val_ratio`. — Juan Manuel Ortiz, Apr 20 '22 at 16:41

score 5 · Answer 2 · answered Sep 02 '20 at 13:13

this got more than a year, but i found my self in a similare situation where i have labels and a groups, and due to the nature of the groups one group of data points can be either in test only or in train only, i've wrote this a small algo using pandas and sklearn i hope this would help

from sklearn.model_selection import GroupShuffleSplit
groups = df.groupby('label')
all_train = []
all_test = []
for group_id, group in groups:
    # if a group is already taken in test or train it must stay there
    group = group[~group['groups'].isin(all_train+all_test)]
    # if group is empty 
    if group.shape[0] == 0:
        continue
    train_inds, test_inds = next(GroupShuffleSplit(
        test_size=valid_size, n_splits=2, random_state=7).split(group, groups=group['groups']))

    all_train += group.iloc[train_inds]['groups'].tolist()
    all_test += group.iloc[test_inds]['groups'].tolist()



train= df[df['groups'].isin(all_train)]
test= df[df['groups'].isin(all_test)]

form_train = set(train['groups'].tolist())
form_test = set(test['groups'].tolist())
inter = form_train.intersection(form_test)

print(df.groupby('label').count())
print(train.groupby('label').count())
print(test.groupby('label').count())
print(inter) # this should be empty

amin_nejad · Accepted Answer · 2019-07-04T16:58:49.420

Essentially I need StratifiedGroupShuffleSplit which does not exist (Github issue). This is because the behaviour of such a function is unclear and accomplishing this to yield a dataset which is both grouped and stratified is not always possible (also discussed here) - especially with a heavily imbalanced dataset such as mine. In my case, I want grouping to be done strictly to ensure there is no overlap of groups whatsoever whilst stratification and the dataset ratio split of 60:20:20 to be done approximately i.e. as well as is possible.

As Ghanem mentions, I have no choice but to build a function to split the dataset myself, which I have done below:

def StratifiedGroupShuffleSplit(df_main):

    df_main = df_main.reindex(np.random.permutation(df_main.index)) # shuffle dataset

    # create empty train, val and test datasets
    df_train = pd.DataFrame()
    df_val = pd.DataFrame()
    df_test = pd.DataFrame()

    hparam_mse_wgt = 0.1 # must be between 0 and 1
    assert(0 <= hparam_mse_wgt <= 1)
    train_proportion = 0.6 # must be between 0 and 1
    assert(0 <= train_proportion <= 1)
    val_test_proportion = (1-train_proportion)/2

    subject_grouped_df_main = df_main.groupby(['subject_id'], sort=False, as_index=False)
    category_grouped_df_main = df_main.groupby('category').count()[['subject_id']]/len(df_main)*100

    def calc_mse_loss(df):
        grouped_df = df.groupby('category').count()[['subject_id']]/len(df)*100
        df_temp = category_grouped_df_main.join(grouped_df, on = 'category', how = 'left', lsuffix = '_main')
        df_temp.fillna(0, inplace=True)
        df_temp['diff'] = (df_temp['subject_id_main'] - df_temp['subject_id'])**2
        mse_loss = np.mean(df_temp['diff'])
        return mse_loss

    i = 0
    for _, group in subject_grouped_df_main:

        if (i < 3):
            if (i == 0):
                df_train = df_train.append(pd.DataFrame(group), ignore_index=True)
                i += 1
                continue
            elif (i == 1):
                df_val = df_val.append(pd.DataFrame(group), ignore_index=True)
                i += 1
                continue
            else:
                df_test = df_test.append(pd.DataFrame(group), ignore_index=True)
                i += 1
                continue

        mse_loss_diff_train = calc_mse_loss(df_train) - calc_mse_loss(df_train.append(pd.DataFrame(group), ignore_index=True))
        mse_loss_diff_val = calc_mse_loss(df_val) - calc_mse_loss(df_val.append(pd.DataFrame(group), ignore_index=True))
        mse_loss_diff_test = calc_mse_loss(df_test) - calc_mse_loss(df_test.append(pd.DataFrame(group), ignore_index=True))

        total_records = len(df_train) + len(df_val) + len(df_test)

        len_diff_train = (train_proportion - (len(df_train)/total_records))
        len_diff_val = (val_test_proportion - (len(df_val)/total_records))
        len_diff_test = (val_test_proportion - (len(df_test)/total_records)) 

        len_loss_diff_train = len_diff_train * abs(len_diff_train)
        len_loss_diff_val = len_diff_val * abs(len_diff_val)
        len_loss_diff_test = len_diff_test * abs(len_diff_test)

        loss_train = (hparam_mse_wgt * mse_loss_diff_train) + ((1-hparam_mse_wgt) * len_loss_diff_train)
        loss_val = (hparam_mse_wgt * mse_loss_diff_val) + ((1-hparam_mse_wgt) * len_loss_diff_val)
        loss_test = (hparam_mse_wgt * mse_loss_diff_test) + ((1-hparam_mse_wgt) * len_loss_diff_test)

        if (max(loss_train,loss_val,loss_test) == loss_train):
            df_train = df_train.append(pd.DataFrame(group), ignore_index=True)
        elif (max(loss_train,loss_val,loss_test) == loss_val):
            df_val = df_val.append(pd.DataFrame(group), ignore_index=True)
        else:
            df_test = df_test.append(pd.DataFrame(group), ignore_index=True)

        print ("Group " + str(i) + ". loss_train: " + str(loss_train) + " | " + "loss_val: " + str(loss_val) + " | " + "loss_test: " + str(loss_test) + " | ")
        i += 1

    return df_train, df_val, df_test

df_train, df_val, df_test = StratifiedGroupShuffleSplit(df_main)

I have created some arbitrary loss function based on 2 things:

The average squared difference in the percentage representation of each category compared to the overall dataset
The squared difference between the proportional length of the dataset compared to what it should be according to the ratio supplied (60:20:20)

Weighting these two inputs to the loss function is done by the static hyperparameter hparam_mse_wgt. For my particular dataset, a value of 0.1 worked well but I would encourage you to play around with it if you use this function. Setting it to 0 will prioritise only maintaining the split ratio and ignore the stratification. Setting it to 1 would be vice versa.

Using this loss function, I then iterate through each subject (group) and append it to the appropriate dataset (training, validation or test) according to whichever has the highest loss function.

It's not particularly complicated but it does the job for me. It won't necessarily work for every dataset, but the larger it is, the better the chance. Hopefully someone else will find it useful.

score 2 · Answer 4 · answered Nov 03 '20 at 13:26

I just had to solve the same problem. In my document processing use case I wanted words from the same page to stick together (group), while document categories should be split across the train and test set evenly (stratify). For my problem it holds that for all instances of one group we have the same stratification category, i.e. all words from one page belong to the same category. Therefore, I found it easiest to perform the stratified split on the groups directly and then to use the split groups to select the instances. Where this assumption does not hold, this solution is not applicable though.

from typing import Tuple

import pandas as pd
from sklearn.model_selection import train_test_split


def stratified_group_train_test_split(
    samples: pd.DataFrame, group: str, stratify_by: str, test_size: float
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    groups = samples[group].drop_duplicates()
    stratify = samples.drop_duplicates(group)[stratify_by].to_numpy()
    groups_train, groups_test = train_test_split(groups, stratify=stratify, test_size=test_size)

    samples_train = samples.loc[lambda d: d[group].isin(groups_train)]
    samples_test = samples.loc[lambda d: d[group].isin(groups_test)]

    return samples_train, samples_test

score 0 · Answer 5 · answered Jul 03 '19 at 15:04

I think in this case you have to build your own function to split the data. This is an implementation by me:

def split(df, based_on='subject_id', cv=5):
    splits = []
    based_on_uniq = df[based_on]#set(df[based_on].tolist())
    based_on_uniq = np.array_split(based_on_uniq, cv)
    for fold in based_on_uniq:
        splits.append(df[df[based_on] == fold.tolist()[0]])
    return splits


if __name__ == '__main__':
    df = pd.DataFrame([{'note_id': 1, 'subject_id': 1, 'category': 'test1', 'note': 'test1'},
                       {'note_id': 2, 'subject_id': 1, 'category': 'test2', 'note': 'test2'},
                       {'note_id': 3, 'subject_id': 2, 'category': 'test3', 'note': 'test3'},
                       {'note_id': 4, 'subject_id': 2, 'category': 'test4', 'note': 'test4'},
                       {'note_id': 5, 'subject_id': 3, 'category': 'test5', 'note': 'test5'},
                       {'note_id': 6, 'subject_id': 3, 'category': 'test6', 'note': 'test6'},
                       {'note_id': 7, 'subject_id': 4, 'category': 'test7', 'note': 'test7'},
                       {'note_id': 8, 'subject_id': 4, 'category': 'test8', 'note': 'test8'},
                       {'note_id': 9, 'subject_id': 5, 'category': 'test9', 'note': 'test9'},
                       {'note_id': 10, 'subject_id': 5, 'category': 'test10', 'note': 'test10'},
                       ])
    print(split(df))

I think you're probably right. Thanks for the starter code for returning split points. I guess the question is how much of it can be made easier by using library-provided functions. I'll see what I can come up with — amin_nejad, Jul 03 '19 at 15:51

score 0 · Answer 6 · answered Jul 09 '21 at 16:07

As others have commented before: StratifiedGroupShuffleSplit doesn't exist, as you might not be able to guarantee that grouped splits will have a similar number of instances of each class. However, you can go for a silly, but a painfully easy solution that will provide a good enough solution eventually:

Use GroupShuffleSplit with a random state set (eg. GroupShuffleSplit(n_splits=1, test_size=0.3, random_state=0))
Calculate the balance between classes in each split.
If it's not to your satisfaction, just run again with random_state set to another value.
Continue until the split is good enough.

This method is obviously best for a low number of splits and a binary label.

score 0 · Answer 7 · answered Jun 15 '23 at 11:38

In my case, I assumed the samples in a same group have a same label. So I combined StratifiedShuffleSplit with GroupShuffleSplit like this

class StratifiedGroupShuffleSplit(StratifiedShuffleSplit):
    """
    Note there is an assumption that the samples in a same group have a same label.
    """
    def __init__(
        self, n_splits = 10, *, test_size = None, 
        train_size = None, random_state = None
    ):
        super().__init__(
            n_splits = n_splits,
            test_size = test_size,
            train_size = train_size,
            random_state = random_state,
        )
        self._default_test_size = 0.1

    def _iter_indices(self, X, y, groups = None):
        if groups is None:
            raise ValueError("The 'groups' parameter should not be None.")
        groups = check_array(groups, input_name = "groups", ensure_2d = False, dtype = None)
        classes, group_indices = np.unique(groups, return_inverse = True)
        stratify = np.array([y[indices[0]] for indices in group_indices])

        for group_train, group_test in super()._iter_indices(X = classes, y = stratify):
            # these are the indices of classes in the partition
            # invert them into data indices

            train = np.flatnonzero(np.in1d(group_indices, group_train))
            test = np.flatnonzero(np.in1d(group_indices, group_test))

            yield train, test

Complex dataset split - StratifiedGroupShuffleSplit

7 Answers7