2

I want to split my dataset into two parts, 75% for training and 25% for testing. There are two classes. And I have another dataset that has only one instance of one class, rest all instances belong to second class. So I dont want to split randomly. I want to make sure, if there is only one instance of one class, it should be in training. Any ideas how to do it. I know I have to select the indices, but i don't know how. Right now , I am doing this, which is selecting first 75% as training and remaining as testing

train_data = df[:int((len(df)+1)*.75)] 
test_data = df[int(len(df)*.75+1):] 
Ara
  • 145
  • 2
  • 10

4 Answers4

2

This could help : GroupKFold. Find the sklearn doc there :

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html

NicolasWoloszko
  • 379
  • 4
  • 6
1

You are looking for a stratified train and test split: sklearn.model_selection.StratifiedKFold.html

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
0

Does your dataset change in terms of input, or will it be the same amount of data consistently? If the latter, you could simply assign whatever number is 75% of your total set as your second argument for the splice method. For instance, if you have 100 items, you'd assign your train_data = df[0:75:] and the other test_data = df[76:].

But without a model or shortened script, I don't think I can do much more.

0

Try this:

train_data = df[:int(len(df) * .75)] test_data = df[int(len(df) * .75)::int(len(df) * .25) - 1]

It worked for me when tested against a list of 10 integers.

  • Can you please explain this line test_data = df[int(len(df) * .75)::int(len(df) * .25) - 1] – Ara Mar 30 '18 at 15:30
  • Sure. It helps in this case to read this splice backwards. The third and last argument skips over the first 25% of the list, then the first argument reads the rest which happens to be the remaining 75%. And there is no second argument since there doesn't need to be one, which is why it's left blank. –  Mar 30 '18 at 15:50
  • Also, if this answer works for you, then please select it as such and close the question. –  Mar 30 '18 at 20:34