2

I have a data set of animal types with ID's and I want to break said data set into Test/Train data sets. I also want to keep all ID's for a respective animal within either the Train or Test data set. An example of the data is below with a random Train/Test split ratio of 80/20.

Animal  ID  Test/Train
CAT 1   TRAIN
CAT 1   TRAIN
CAT 2   TRAIN
CAT 2   TRAIN
CAT 3   TRAIN
CAT 3   TEST
CAT 4   TRAIN
CAT 4   TRAIN
CAT 5   TEST
CAT 5   TRAIN
DOG 1   TRAIN
DOG 1   TRAIN
DOG 2   TRAIN
DOG 2   TRAIN
DOG 3   TRAIN
DOG 3   TRAIN
DOG 4   TEST
DOG 4   TEST
DOG 5   TRAIN
DOG 5   TRAIN

Note how CAT with ID 3 and ID 5 exists in both Train and Test data sets. Is there a function within scikit-learn train_test_split that enables the ability to keep all like values in a column within the same train/test data set while maintaining the test ratio? So if CAT with ID 3 has one value flagged as Train data then any other records with CAT and ID 3 would also be flagged as Train data.

AlmostThere
  • 557
  • 1
  • 11
  • 26

2 Answers2

1

I found the solution to your request: Here's a link!

from sklearn.model_selection import GroupShuffleSplit 

splitter = GroupShuffleSplit(test_size=0.2, n_splits=2, random_state = 7)
split = splitter.split(df, groups=df['ID'])
train_inds, test_inds = next(split)

train = df.iloc[train_inds]
test = df.iloc[test_inds]

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 01 '22 at 06:49
0

Did you keep the stratify parameter as yes if so then remove it and check.

  • Hi Aditya! thanks for mentioning that. I thought that stratify maintains the ratio of the field you are trying to predict between the Train and Test data sets(ie if the data is split 75/25 on a binary field, the train and test data sets would maintain that ratio). In my example, I want to make sure all IDs for a respective animal exist in either the Train or Test data set and not found in both data sets irrelevant of what ever the field being predicted is. – AlmostThere Oct 02 '20 at 12:31