0

Let's say I have a dataframe that looks something like this:
The following table is an example, I have like 120000 questions

Question | Hint | Cluster Label|
q1 |q1_h1 |1
q1 |q1_h2 |1
q1 |q1_h3 |1
q2 |q2_h1 |2
q2 |q2_h2 |2
q3 |q3_h1 |1
q4 |q4_h1 |2
q4 |q4_h2 |2

I want to groupby question and split dataframe into train and test such that associated question and hints are captured together and stratified on label. So output that I require would be:

train:
Question | Hint | Cluster Label|
q1 |q1_h1 |1
q1 |q1_h2 |1
q1 |q1_h3 |1
q2 |q2_h1 |2
q2 |q2_h2 |2

test:
Question | Hint | Cluster Label|
q3 |q3_h1 |1
q4 |q4_h1 |2
q4 |q4_h2 |2

HelloWorld
  • 77
  • 3
  • 9

2 Answers2

1

You can simply split the DataFrame according to the value in Hint:

df_train= df[(df['Hint'].str.contains('q1')) | (df['Hint'].str.contains('q2'))]

and similarly for df_test

user19077881
  • 3,643
  • 2
  • 3
  • 14
  • Sorry for not mentioning, but the above table is an example, I have like 120000 questions – HelloWorld Feb 05 '23 at 22:32
  • Then you need to explain how you want the split to take place ie. define split condition(s) and maybe provide an example split which is an adequate representation of what you want. – user19077881 Feb 05 '23 at 23:53
0

Looks like you need to use GroupKFold or StratifiedGroupKFold.

From the user manual, GroupKFold "is a variation of k-fold which ensures that the same group is not represented in both testing and training sets."

To use it, you call the constructor as normal:

gkf = GroupKFold(n_splits = 5)

and when you call the split method of gkf you specify the variable to group on (in your case 'Question').

If you're using it in GridSearchCV or similar, you specify the group in as the 'groups' variable in the call to GridSearchCV. See previous answer here.

njp
  • 620
  • 1
  • 3
  • 16