Which metric to use for imbalanced classification problem?

Question

I am working on a classification problem with very imbalanced classes. I have 3 classes in my dataset : class 0,1 and 2. Class 0 is 11% of the training set, class 1 is 13% and class 2 is 75%.

I used and random forest classifier and got 76% accuracy. But I discovered 93% of this accuracy comes from class 2 (majority class). Here is the Crosstable I got.

The results I would like to have :

fewer false negatives for class 0 and 1 OR/AND fewer false positives for class 0 and 1

What I found on the internet to solve the problem and what I've tried :

using class_weight='balanced' or customized class_weight ( 1/11% for class 0, 1/13% for class 1, 1/75% for class 2), but it doesn't change anything (the accuracy and crosstable are still the same). Do you have an interpretation/explenation of this ?
as I know accuracy is not the best metric in this context, I used other metrics : precision_macro, precision_weighted, f1_macro and f1_weighted, and I implemented the area under the curve of precision vs recall for each class and use the average as a metric.

Here's my code (feedback welcome) :

from sklearn.preprocessing import label_binarize

def pr_auc_score(y_true, y_pred):
    y=label_binarize(y_true, classes=[0, 1, 2])
    return average_precision_score(y[:,:],y_pred[:,:])

pr_auc = make_scorer(pr_auc_score, greater_is_better=True,needs_proba=True)

and here's a plot of the precision vs recall curves.

Alas, for all these metrics, the crosstab remains the same... they seem to have no effect

I also tuned the parameters of Boosting algorithms ( XGBoost and AdaBoost) (with accuracy as metric) and again the results are not improved.. I don't understand because boosting algorithms are supposed to handle imbalanced data
Finally, I used another model (BalancedRandomForestClassifier) and the metric I used is accuracy. The results are good as we can see in this crosstab. I am happy to have such results but I notice that, when I change the metric for this model, there is again no change in the results...

So I'm really interested in knowing why using class_weight, changing the metric or using boosting algorithms, don't lead to better results...

Did you try Xgboost using array of weights? I have used xgboost for imbalanced binary class classification and setting scale_pos_weight improved the performance of model. As u have a multi class classification u can not use scale_pos_weight unless you use one vs rest approach, but instead you can use array of weights and that should solve the problem. — Muhammad Hassan, Sep 09 '21 at 04:26

score 0 · Answer 1 · answered Sep 10 '21 at 06:53

As you have figured out, you have encountered the "accuracy paradox";

Say you have a classifier which has an accuracy of 98%, it would be amazing, right? It might be, but if your data consists of 98% class 0 and 2% class 1, you obtain a 98% accuracy by assigning all values to class 0, which indeed is a bad classifier.

So, what should we do? We need a measure which is invariant to the distribution of the data - entering ROC-curves.

ROC-curves are invariant to the distribution of the data, thus are a great tool to visualize classification-performances for a classifier whether or not it is imbalanced. But, they only work for a two-class problem (you can extend it to multiclass by creating a one-vs-rest or one-vs-one ROC-curve).
F-score might a bit more "tricky" to use than the ROC-AUC since it's a trade off between precision and recall and you need to set the beta-variable (which is often a "1" thus the F1 score).

You write: "fewer false negatives for class 0 and 1 OR/AND fewer false positives for class 0 and 1". Remember, that all algorithms work by either minimizing something or maximizing something - often we minimize a loss function of some sort. For a random forest, lets say we want to minimize the following function L:

L = (w0+w1+w2)/n

where wi is the number of class i being classified as not class i i.e if w0=13 we have missclassified 13 samples from class 0, and n the total number of samples.

It is clear that when class 0 consists of most of the data then an easy way to get a small L is to classify most of the samples as 0. Now, we can overcome this by adding a weight instead to each class e.g

L = (b0*w0+b1*w1+b2*x2)/n

as an example say b0=1, b1=5, b2=10. Now you can see, we cannot just assign most of the data to c0 without being punished by the weights i.e we are way more conservative by assigning samples to class 0, since assigning a class 1 to class 0 gives us 5 times as much loss now as before! This is exactly how the weight in (most) of the classifiers work - they assign a penalty/weight to each class (often proportional to it's ratio i.e if class 0 consists of 80% and class 1 consists of 20% of the data then b0=1 and b1=4) but you can often specify the weight your self; if you find that the classifier still generates to many false negatives of a class then increase the penalty for that class.

Unfortunately "there is no such thing as a free lunch" i.e it's a problem, data and usage specific choice, of what metric to use.

On a side note - "random forest" might actually be bad by design when you don't have much data due to how the splits are calculated (let me know, if you want to know why - it's rather easy to see when using e.g Gini as splitting). Since you have only provided us with the ratio for each class and not the numbers, I cannot tell.

Which metric to use for imbalanced classification problem?

1 Answers1