How to Handle Imbalanced Data in a Classification Problem?

Question

I am working on a binary classification problem using machine learning, where my target classes are imbalanced. I have approximately 80% of data points in Class A and only 20% in Class B.

I have tried using various classifiers like Random Forest and Logistic Regression, but the model seems to favor the majority class and performs poorly on the minority class.

I've heard about techniques like oversampling, undersampling, and using class weights to address imbalanced data. However, I'm unsure which approach to take and how to implement it in Python using libraries like scikit-learn.

Could you please provide guidance on the best practices and code examples to handle imbalanced data in a multi-classification problem? Are there any specific performance metrics I should focus on when evaluating the model?

Thank you in advance for your help!!

I have tried using various classifiers like Random Forest and Logistic Regression, but the model seems to favor the majority class and performs poorly on the minority class. I've heard about techniques like oversampling, undersampling, and using class weights to address imbalanced data. However, I'm unsure which approach to take and how to implement it in Python using libraries like scikit-learn.

score 0 · Answer 1 · answered Jul 17 '23 at 20:08

There are a number of options when dealing with imbalanced data.

1. You could use a weighting mechanism, whereby errors on the minor class are penalised more heavily.

From my own experience, SVMs (support vector machines) and XGBoost models are able to adjust weights to penalise errors on the minor class more heavily.

For instance, if generating classification predictions using an SVM, then the class_weight can be set to balanced as below in order to treat both classes equally:

from sklearn import svm
model = svm.SVC(gamma='scale', 
            class_weight='balanced')
model.fit(x1_train, y1_train)  
predictions = clf.predict(x1_val)

For XGBoost, the scale_pos_weight can be set at the appropriate value so as to penalise errors on the minor class more heavily. The higher the value, the greater the weight appended to the minor class.

import xgboost as xgb
xgb_model = xgb.XGBClassifier(learning_rate=0.001,
                            max_depth = 1, 
                            n_estimators = 100,
                              scale_pos_weight=3)
xgb_model.fit(x1_train, y1_train)

2. For oversampling the minor class, a technique such as SMOTE from the imblearn library can be used:

from imblearn.over_sampling import SMOTE
oversample = SMOTE()
x1_train, y1_train = oversample.fit_resample(x1_train, y1_train)

This technique generates synthetic samples from the characteristics of the minor class so that the number of values for both classes are equal.

As for which technique to use - my recommendation would be to assess which technique performs best when comparing the predictions to the test data. However, I would add a caveat in that accuracy readings should be analysed with scepticism.

Accuracy vs. Precision vs. Recall

Let's take this example. We build a model that classifies on a dataset with a 90% major class and a 10% minor class. The model shows 90% accuracy when predicting against a test set.

However, there is a problem. The model fails to correctly classify any of the observations across the minor class in the test set. Thus, the model does very well at predicting the major class but very poorly at predicting the minor class.

In this regard, you should also note the readings of precision (no false positives) and recall (no false negatives). As an example, let us say a company wants to predict customers that cancel their subscription of a product (1 = cancel, 0 = do not cancel). 90% of customers do not cancel, but 10% do.

In this instance - because we want to minimise false negatives - we are looking for a high recall score. In this regard, a model with 60% overall accuracy but 90% recall would be preferable to a model with 90% accuracy but only 10% recall.

How to Handle Imbalanced Data in a Classification Problem?

1 Answers1