0

Dataset has around 150k records with four labels: ['A','B','C','D'] and the distribution is as follows:
A: 60000
B: 50000
C: 36000
D: 4000

I notice using the package classification report to get the precision, recall, and f1-score, the f1-score is causing an UndefinedMetricWarning because class D is not being predicted due to the low number of records.

I know that I need to perform oversample/undersample to fix the imbalanced data.

Question: Would it be a good idea to fix the imbalanced data but randomly sample 4000 records from each class so that it is balanced?

mathgeek
  • 125
  • 7

1 Answers1

1

I think you want to oversample from your class D. The technique is called Synthetic Minority Oversampling Technique, or SMOTE.

One way to solve this problem is to oversample the examples in the minority class. This can be achieved by simply duplicating examples from the minority class in the training dataset prior to fitting a model. This can balance the class distribution but does not provide any additional information to the model.

An improvement on duplicating examples from the minority class is to synthesize new examples from the minority class. This is a type of data augmentation for tabular data and can be very effective.

Source: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

user212514
  • 3,110
  • 1
  • 15
  • 11
  • would it be reasonable to use a mix of undersampling & oversampling? For example, over sample class D, while under sample the remaining classes? – mathgeek May 17 '21 at 04:54
  • 1
    I would want to use as much real data as possible. So, I would like to keep A, B, and C as much as possible. By using SMOTE you can reasonably keep the full sets from A, B, and C while still having enough real and fabricated data from D. You might consider looking at how others do transaction fraud classification because there are frequently small numbers of examples of fraud. – user212514 May 17 '21 at 04:56
  • Great suggestion. I will take a look at some examples related to transaction fraud classification. Thanks for your guidance! – mathgeek May 17 '21 at 04:59