dealing with imbalanced classification data?

Question

I am building a predictive model, on which I predict if a client will subscribe again or not. I already have the dataset and the problem is that it is imbalanced ( the NOs are more then the YESs). I believe that my model is biased, but when I check the accuracy on the training set and the testing set with the predictions made the accuracy is really close (0.8879 on training set and 0.8868 on the test set). The reason why I am confused, is if my model is biased why do I have the accuracy of training and test set close? Or is my model not biased?

what's the actual no to yes ratio? also, accuracy is a poor metric for imbalanced classes, depending on what your goal is you need to evaluate on a different metric. now, with that aside, i don't really think this question fits on SO, seems more theoretical rather than anything else programming related, better suited for stats or machine learning stackexchange. — Paritosh Singh, Mar 16 '20 at 17:11
https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/ you can read this — Paritosh Singh, Mar 16 '20 at 17:18
Imbalanced dataset and biased model are two different matters; the former is a property of the dataset while the latter concerns the learning algorithm and how it has been trained. — Reveille, Mar 16 '20 at 18:14

alift · Answer 1 · 2020-03-16T19:15:48.043

Quick response: Yes, your model is very likely to predict everything as the Majority Class.

Let's think of it in a simpler way. You have an optimizer in the training process, who tries to maximize the accuracy (minimize the misclassification). Suppose you have a training set of 1000 images, and you have only 10 tigers in that dataset, and you intend to learn a classifier to distinguish tigers vs non-tigers.

What the optimizer is very likely to do is to predict always non-tiger for every single image. Why? cause it is a much simpler model and easier(likelier in a simpler space) to achieve, and also it gets to 99% accuracy!

I suggest you read more about imbalanced data problems( This one seems to be a good one to start https://machinelearningmastery.com/what-is-imbalanced-classification/) Depending on the problem you are to solve, you might one try to down-sampling, or over-sampling or more advanced solutions, like changing the loss functions and metrics, using F1 or AUC and/or doing ranking instead of classification.

dealing with imbalanced classification data?

1 Answers1