I have a problem I am trying to solve: - imbalanced dataset with 2 classes - one class dwarfs the other one (923 vs 38) - f1_macro score when the dataset is used as-is to train RandomForestClassifier stays for TRAIN and TEST in 0.6 - 0.65 range
While doing research on the topic yesterday, I educated myself in resampling and especially SMOTE algorithm. It seems to have worked wonders for my TRAIN score, as after balancing the dataset with them, my score went from ~0.6 up to ~0.97. The way that I have applied it was as follows:
I have splited away my TEST set away from the rest of data in the beginning (10% of the whole data)
I have applied SMOTE on TRAIN set only (class balance 618 vs 618)
I have trained a RandomForestClassifier on TRAIN set, and achieved f1_macro = 0.97
when testing with TEST set, f1_macro score remained in ~0.6 - 0.65 range
What I would assume happened, is that the holdout data in TEST set held observations, which were vastly different from pre-SMOTE observations of the minority class in TRAIN set, which ended up teaching the model to recognize cases in TRAIN set really well, but threw the model off-balance with these few outliers in the TEST set.
What are the common strategies to deal with this problem? Common sense would dictate that I should try and capture a very representative sample of minority class in the TRAIN set, but I do not think that sklearn has any automated tools which allow that to happen?