2

I have a imbalanced dataset. I am using SMOTE (Synthetic Minority Oversampling Technique)to perform oversampling. When performing the binary classification, I use 10-fold cross validation on this oversampled dataset.

However, I recently came accross this paper; Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models that mentions that it is incorrect to use the oversampled dataset during cross-validation as it leads to overoptimistic performance estimates.

I want to verify the correct approach/procedure of using the over-sampled data in cross validation?

J Cena
  • 963
  • 2
  • 11
  • 25
  • You always want to perform all the analysis steps for each cross validation fold independently. In this case, over-sample each fold individually. The paper you linked to describes the proper way to do CV. – Gabe Mar 27 '18 at 02:29
  • I think this question should be on Cross Validated https://stats.stackexchange.com/ as it is less about implementation and more about the idea. – Shridhar R Kulkarni Mar 27 '18 at 04:29
  • @Gabe Do you mean that for the 10 folds, I have to perform oversampling separately? – J Cena Mar 27 '18 at 04:38
  • 1
    Yeah, you'd want to over sample each fold by itself, using only the data from that fold. That way, you're essentially doing the classification (including over sampling) to 10 "different" datasets, which is the point of doing CV to estimate performance. I don't use Weka myself, but it seems like nekomatic's answer explains how to implement it there. – Gabe Mar 27 '18 at 13:56

2 Answers2

3

To avoid overoptimistic performance estimates from cross-validation in Weka when using a supervised filter, use FilteredClassifier (in the meta category) and configure it with the filter (e.g. SMOTE) and classifier (e.g. Naive Bayes) that you want to use.

For each cross-validation fold Weka will use only that fold's training data to parameterise the filter.

When you do this with SMOTE you won't see a difference in the number of instances in the Weka results window, but what's happening is that Weka is building the model on the SMOTE-applied dataset, but showing the output of evaluating it on the unfiltered training set - which makes sense in terms of understanding the real performance. Try changing the SMOTE filter settings (e.g. the -P setting, which controls how many additional minority-class instances are generated as a percentage of the number in the dataset) and you should see the performance changing, showing you that the filter is actually doing something.

The use of FilteredClassifier is illustrated in this video and these slides from the More Data Mining with Weka online course. In this example the filtering operation is supervised discretisation, not SMOTE, but the same principle applies to any supervised filter.

If you have further questions about the SMOTE technique I suggest asking them on Cross Validated and/or the Weka mailing list.

nekomatic
  • 5,988
  • 1
  • 20
  • 27
  • If I understand you correctly the procedure is; 1. Apply SMOTE in 'preprocess' section in Weka GUI 2. Go to FilteredClassifier (meta section) in 'Classify' section in Weka GUI 3. In the FilteredClassifier choose the 'classier' I want as 'Naive Bayes' 4. What should I select for 'filter' parameter in FilteredClassifier? – J Cena Mar 28 '18 at 02:05
  • 1
    No, don't apply anything in the `Preprocess` pane (because that will use the whole dataset and therefore bias the performance result). Configure your `FilteredClassifier` with `Naive Bayes` as the classifier and `SMOTE` as the filter. – nekomatic Mar 28 '18 at 08:10
  • Thanks a lot. With my configuration of FilteredClassifier with Naive Bayes as the classifier and SMOTE as the filter, can I use 10-fold CV? – J Cena Mar 28 '18 at 14:20
  • 1
    Yes, that's exactly the idea. See my edit for reference material - I recommend working through the [free online Weka courses](https://weka.waikato.ac.nz/explorer) if you're in any doubt about how to use the program and the techniques. – nekomatic Mar 28 '18 at 15:37
  • Hi, I am performing my experiments as you said. 1) Load my data set 2) in 'Classify' section select FilteredClassifier with `Naive Bayes` as the classifier and `SMOTE` as the filter. 3) Select `10-fold cross validation` option. However, in the results window, **the number of instances** have not changed even though I am using 'SMOTE' to increase the number of instances. Please let me know why this happens? Is there any other parameters that I need to change in Weka? – J Cena Apr 09 '18 at 12:37
  • 1
    I can't quickly find a reference but I'm pretty sure that what's happening is that Weka is building the model on the SMOTE-applied dataset, but showing the output of evaluating it on the unfiltered training set - which makes sense in terms of understanding the real performance. Try changing the SMOTE filter settings (e.g. the `-P` percentage setting) and you should see the performance changing, showing you that SMOTE is actually doing something. – nekomatic Apr 09 '18 at 15:11
  • Thanks a lot. When I change the -p value the results obtained varies,which implies that SMOTE is involved in cross validation :) – J Cena Apr 09 '18 at 15:38
  • I would like to balance my dataset (50%-50% for the two classes). Is it the percentage setting that we need to alter to do this? :) – J Cena Apr 11 '18 at 04:31
  • 1
    Yes, I think -P is the percentage by which to boost the number of samples in the minority class - i.e. 0 would not add any samples, 100 would double the number of samples, etc - so you just need to calculate the appropriate value to use. – nekomatic Apr 11 '18 at 15:38
  • Thanks a lot for your valuable comment. To balance my dataset I needed to use 1300 as my -p value. Is it a bad -p value (because it is a very big number)? – J Cena Apr 12 '18 at 00:53
  • 1
    I'm not an expert on SMOTE so I suggest you take that question to the Cross Validated SE site, and/or the Weka mailing list which is usually helpful (see edited answer for links). The important thing is whether you get acceptable performance on your real data though. – nekomatic Apr 12 '18 at 10:06
  • Thank you very much for the information :) Sure, I will post a question there. – J Cena Apr 12 '18 at 12:38
0

The correct approach would be first splitting the data into multiple folds and then applying sampling just to the training data and let the validation data be as is. The image below states the correct approach of how the dataset should be resampled in a K-fold fashion.

Upsampling/Downsampling with K-Fold Cross validation

If you want to achieve this in python, there is a library for that: Link to the library: https://pypi.org/project/k-fold-imblearn/