Train Test Split sklearn based on group variable

Question

My X is as follows: EDIT1:

Unique ID.   Exp start date.   Value.    Status.
001          01/01/2020.       4000.     Closed
001          12/01/2019        4000.     Archived
002          01/01/2020.       5000.     Closed
002          12/01/2019        5000.     Archived

I want to make sure that none of the unique IDs that were in training are included in testing. I am using sklearn test train split. Is this possible?

score 3 · Answer 1 · answered May 15 '20 at 19:22

3

I believe you need GroupShuffleSplit (documentation here).

import numpy as np
from sklearn.model_selection import GroupShuffleSplit
X = np.ones(shape=(8, 2))
y = np.ones(shape=(8, 1))
groups = np.array([1, 1, 2, 2, 2, 3, 3, 3])
print(groups.shape)

gss = GroupShuffleSplit(n_splits=2, train_size=.7, random_state=42)

for train_idx, test_idx in gss.split(X, y, groups):
    print("TRAIN:", train_idx, "TEST:", test_idx)

TRAIN: [2 3 4 5 6 7] TEST: [0 1]
TRAIN: [0 1 5 6 7] TEST: [2 3 4]

It can be seen from above that train/test indices are created based on the groups variable.

In your case, Unique ID. should be used as groups.

answered May 15 '20 at 19:22

seralouk

30,938
9
118
133

Is there a way to make sure this split is also stratified? – user42 Jul 02 '21 at 12:45
using `GroupShuffleSplit`? No. You need to code that. – seralouk Jul 02 '21 at 13:58
1

I just found sklearn's implementation of [groupstratifiedkfold](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.StratifiedGroupKFold.html) but I can't seem to get it running on my system due to an ImportError. Could you help me with that [here](https://stackoverflow.com/q/68226119/14022582)? – user42 Jul 02 '21 at 14:02

Train Test Split sklearn based on group variable

1 Answers1