I have a sample data as follows:
import pandas as pd
df = pd.DataFrame({"x": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120],
"id": [1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5],
"label": ["a", "a", "a", "b", "a", "b", "b", "b", "a", "b", "a", "b"]})
So my data look like this
x id label
10 1 a
20 1 a
30 1 a
40 1 b
50 2 a
60 2 b
70 3 a
80 3 a
90 4 b
100 4 a
110 5 b
120 5 a
I would like to split this data into two groups (train, test) based on label distribution given the number of test samples (e.g. 6 samples). My settings prefers to define size of test set as integer representing the number of test samples rather than percentage. However, with my specific domain, any id MUST be allocated in ONLY one group. For example, if id 1 was assigned to the training set, other samples with id 1 cannot be assigned to the test set. So the expected output are 2 dataframes as follows:
Training set
x id label
10 1 a
20 1 a
30 1 a
40 1 b
50 2 a
60 2 b
Test set
x id label
70 3 a
80 3 a
90 4 b
100 4 a
110 5 b
120 5 a
Both training set and test set have the same class distribution (a:b is 4:2) and id 1, 2 were assigned to only the training set while id 3, 4, 5 were assigned to only the test set. I used to do with sklearn train_test_split
but I could not figure out how to apply it with such a condition. May I have your suggestions how to handle such conditions?