Undersampling for Imbalanced Class in Python

Question

I currently have an imbalanced dataset of over 800,000 datapoints. The imbalance is severe as there is only 3719 datapoints for one of the two classes. Upon undersampling the data using NearMiss algorithm in Python and applying a Random Forest classifier, I am able to achieve the following results:

Accuracy: 81.4%
Precision: 82.6%
Recall: 79.4%
Specificity: 83.4%

However, when re-testing this same model on the full dataset again, the confusion matrix results show a large bias towards the minority class for some reason, showing a large number of false positives. Is this the correct way of testing the model after undersampling?

score 0 · Accepted Answer · answered Nov 15 '19 at 11:10

Undersampling first from 800k records to 4k might be quite a loss in your domain knowledge. Most of the time you do over-sampling first and under-sampling second. There's dedicated package for that: imblearn. As for validation: you don't want to score resampled records, as it'll mess things up. Look closer into scoring params in sklearn, namely: micro, macro, weighted. Docs are here. There're also some specific metrics for this. Check it here:

Undersampling for Imbalanced Class in Python

1 Answers1