ML with imbalanced binary dataset

Question

I have a problem I am trying to solve: - imbalanced dataset with 2 classes - one class dwarfs the other one (923 vs 38) - f1_macro score when the dataset is used as-is to train RandomForestClassifier stays for TRAIN and TEST in 0.6 - 0.65 range

While doing research on the topic yesterday, I educated myself in resampling and especially SMOTE algorithm. It seems to have worked wonders for my TRAIN score, as after balancing the dataset with them, my score went from ~0.6 up to ~0.97. The way that I have applied it was as follows:

I have splited away my TEST set away from the rest of data in the beginning (10% of the whole data)
I have applied SMOTE on TRAIN set only (class balance 618 vs 618)
I have trained a RandomForestClassifier on TRAIN set, and achieved f1_macro = 0.97
when testing with TEST set, f1_macro score remained in ~0.6 - 0.65 range

What I would assume happened, is that the holdout data in TEST set held observations, which were vastly different from pre-SMOTE observations of the minority class in TRAIN set, which ended up teaching the model to recognize cases in TRAIN set really well, but threw the model off-balance with these few outliers in the TEST set.

What are the common strategies to deal with this problem? Common sense would dictate that I should try and capture a very representative sample of minority class in the TRAIN set, but I do not think that sklearn has any automated tools which allow that to happen?

score 2 · Answer 1 · answered May 10 '19 at 17:11

Your assumption is correct. Your machine learning model is basically overfitting on your training data which has the same pattern repeated for one class and thus, the model learns that pattern and misses the rest of the patterns, that is there in test data. This means that the model will not perform well in the wild world.

If SMOTE is not working, you can experiment by testing different machine learning models. Random forest generally performs well on this type of datasets, so try to tune your rf model by pruning it or tuning the hyperparameters. Another way is to assign the class weights when training the model. You can also try penalized models which imposes an additional cost on the model when the misclassify the minority class.

You can also try undersampling since you have already tested oversampling. But most probably your undersampling will also suffer from the same problem. Please try simple oversampling as well instead of SMOTE to see how your results change.

Another more advanced method that you should experiment is batching. Take all of your minority class and an equal number of entries from the majority class and train a model. Keep doing this for all the batches of your majority class and in the end you will have multiple machine learning models, which you can then use together to vote.

Hi Rajat, thank you for your comments. Can you let me know, if I am doing the right thing by putting aside the TEST set from the main data, even before resampling the minority class with SMOTE? Later on in the training effort, I am doing a gridsearch with CV, where CV also splits out from the TRAIN set its own VALIDATION set. This way I always have some data which is left out in TEST set, which can contain a pattern which is not present in TRAIN or VALIDATION sets. — Greem666, May 14 '19 at 01:01
You have to leave data out for testing otherwise you will never be able to monitor the performance of your model. And yes there might be pattern in test set but that is what machine learning is for. It has to generalize and predict on sets which are a bit different and the model has never seen otherwise it will be machine memorizing, not machine learning. There are several methods to split data in better proportions because a random split might lead to 60% of minority class into test set which is also not good so look for this techniques — secretive, May 14 '19 at 01:37

ML with imbalanced binary dataset

1 Answers1