9

I have a sample data as follows:

import pandas as pd

df = pd.DataFrame({"x": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120],
                   "id": [1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5],
                   "label": ["a", "a", "a", "b", "a", "b", "b", "b", "a", "b", "a", "b"]})

So my data look like this

  x   id   label
 10   1    a
 20   1    a
 30   1    a
 40   1    b
 50   2    a
 60   2    b
 70   3    a
 80   3    a
 90   4    b
100   4    a
110   5    b
120   5    a

I would like to split this data into two groups (train, test) based on label distribution given the number of test samples (e.g. 6 samples). My settings prefers to define size of test set as integer representing the number of test samples rather than percentage. However, with my specific domain, any id MUST be allocated in ONLY one group. For example, if id 1 was assigned to the training set, other samples with id 1 cannot be assigned to the test set. So the expected output are 2 dataframes as follows:

Training set

  x   id   label
 10   1    a
 20   1    a
 30   1    a
 40   1    b
 50   2    a
 60   2    b

Test set

  x   id   label
 70   3    a
 80   3    a
 90   4    b
100   4    a
110   5    b
120   5    a

Both training set and test set have the same class distribution (a:b is 4:2) and id 1, 2 were assigned to only the training set while id 3, 4, 5 were assigned to only the test set. I used to do with sklearn train_test_split but I could not figure out how to apply it with such a condition. May I have your suggestions how to handle such conditions?

yatu
  • 86,083
  • 12
  • 84
  • 139
  • The only approach that comes to my mind is to split train/test over ids (split unique ids in 2 sets), but this might not yield the desired split percentage of data, which inherently you can't achieve in this problem from what I see. – Farhood ET Apr 21 '20 at 07:14
  • I like this method - https://stackoverflow.com/q/54797508/10276092 – M.Viking Jul 15 '22 at 01:18

2 Answers2

9

sklearn.model_selection has several other options other than train_test_split. One of them, aims at solving what you're after. In this case you could use GroupShuffleSplit, which as mentioned inthe docs it provides randomized train/test indices to split data according to a third-party provided group. You also have GroupKFold for these cases which is very useful.

from sklearn.model_selection import GroupShuffleSplit

X = df.drop('label',1)
y=df.label

You can now instantiate GroupShuffleSplit, and do as you would with train_test_split, with the only difference of specifying a group column, which will be used to split X and y so the groups are split according the the groups values:

gs = GroupShuffleSplit(n_splits=2, test_size=.6, random_state=0)
train_ix, test_ix = next(gs.split(X, y, groups=X.id))

Now you can index the dataframe to create the train and test sets:

X_train = X.loc[train_ix]
y_train = y.loc[train_ix]

X_test = X.loc[test_ix]
y_test = y.loc[test_ix]

Giving:

print(X_train)

      x  id
4    50   2
5    60   2
8    90   4
9   100   4
10  110   5
11  120   5

And for the test set:

print(X_test)

   x  id
0  10   1
1  20   1
2  30   1
3  40   1
6  70   3
7  80   3
yatu
  • 86,083
  • 12
  • 84
  • 139
  • Thank you very much. It works perfectly. I'm wondering ... what if i would like to split into 3 groups? When i set the input argument n_splits=3, ValueError: not enough values to unpack (expected 3, got 2) – Ratchainant Thammasudjarit Apr 21 '20 at 08:23
  • Yes that is in the case you want to iterate over the splits. See the example [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupShuffleSplit.html#sklearn.model_selection.GroupShuffleSplit.get_n_splits) @RatchainantThammasudjarit Glad it helped :) – yatu Apr 21 '20 at 08:27
  • Is there a way to make sure this is a stratified split? – user42 Jul 02 '21 at 12:43
1

Adding to Yatu's brilliant answer, you can split your data only using pandas if you liked, although its better to use what was proposed in his answer.

import pandas as pd

df = pd.DataFrame(
    {
        "x": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120],
        "id": [1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5],
        "label": ["a", "a", "a", "b", "a", "b", "b", "b", "a", "b", "a", "b"],
    }
)


TRAIN_TEST_SPLIT_PERC = 0.75
uniques = df["id"].unique()
sep = int(len(uniques) * TRAIN_TEST_SPLIT_PERC)
df = df.sample(frac=1).reset_index(drop=True) #For shuffling your data
train_ids, test_ids = uniques[:sep], uniques[sep:]
train_df, test_df = df[df.id.isin(train_ids)], df[df.id.isin(test_ids)]


print("\nTRAIN DATAFRAME\n", train_df)
print("\nTEST DATAFRAME\n", test_df)
Farhood ET
  • 1,432
  • 15
  • 32