1

I have a data set with 50% instances from class A and 50% instances of class B. I want to split my data set into a training set and a test set. I know the RemovePercentage filter exists but it doesn't care about the class balance. How do I remove 35% from my data set but still keep a 50/50 class distribution in the training set?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Stanko
  • 4,275
  • 3
  • 23
  • 51

2 Answers2

1

Take a look at Stratified Remove Folds. It strives to maintain the original class distributions. http://weka.sourceforge.net/doc.stable/weka/filters/supervised/instance/StratifiedRemoveFolds.html

Walter
  • 2,811
  • 2
  • 21
  • 23
  • 1
    It works to generate a test set using Stratified Remove Folds but those instances aren't removed from the whole data set so my training set still has the instances from the test set. – Stanko May 11 '17 at 14:28
1

Ok, I've found a way using the filter StratifiedRemoveFolds:

Step 1

Open your data set in the Weka Explorer and choose the supervised instance filter StratifiedRemoveFolds.

Step 2

Decide the sizes you want for your training and test set. If you want your sets to have an equal size then pick for numFolds 2. Apply the filter. This will generate a data set that contains 50 % of the data from the original set. (If you want 67 % train data and 33 % test data then pick 3 for numFolds)

Step 3

Save this generated set as f.e. "train.arff". When the first set is saved you must Undo the action so that you are back with your full data set.

Step 4

Click on the StratifiedRemoveFolds filter and change the parameter invertSelection from False to True. Now when you apply that filter a set will be generated like in step 2 but it will contain the other 50 % of the data set.

Step 5

Save this as "test.arff**. Now you have a train and test set that respect your class balance.

Stanko
  • 4,275
  • 3
  • 23
  • 51
  • One thing to note that if **numFolds** is set to n (where n != 2), the first filtered data set contains 1/n of the original, which shall be named "test.arff". – Fanchen Bao Oct 19 '21 at 04:16