Let's say I have a dataframe that looks something like this:
The following table is an example, I have like 120000 questions
Question | Hint | Cluster Label|
q1 |q1_h1 |1
q1 |q1_h2 |1
q1 |q1_h3 |1
q2 |q2_h1 |2
q2 |q2_h2 |2
q3 |q3_h1 |1
q4 |q4_h1 |2
q4 |q4_h2 |2
I want to groupby question and split dataframe into train and test such that associated question and hints are captured together and stratified on label. So output that I require would be:
train:
Question | Hint | Cluster Label|
q1 |q1_h1 |1
q1 |q1_h2 |1
q1 |q1_h3 |1
q2 |q2_h1 |2
q2 |q2_h2 |2
test:
Question | Hint | Cluster Label|
q3 |q3_h1 |1
q4 |q4_h1 |2
q4 |q4_h2 |2