I am working on a multi class classification use case and the data is highly imbalanced. By highly imbalanced data I mean that there is a huge difference between class with maximum frequency and the class with minimum frequency. So if I go ahead using SMOTE oversampling
then the data size increases tremendously (data size goes from 280k rows to more than 25 billion rows because the imbalance is too high) and it becomes practically impossible to fit a ML model to such a huge dataset. Similarly I can't use undersampling as that would lead to loss of information.
So I thought of using compute_class_weight
from sklearn while creating a ML model.
Code:
from sklearn.utils.class_weight import compute_class_weight
class_weight = compute_class_weight(class_weight='balanced',
classes=np.unique(train_df['Label_id']),
y=train_df['Label_id'])
dict_weights = dict(zip(np.unique(train_df['Label_id']), class_weight))
svc_model = LinearSVC(class_weight=dict_weights)
I did predictions on test data and noted the result of metrics like accuracy
, f1_score
, recall
etc.
I tried to replicate the same but by not passing class_weight
, like this:
svc_model = LinearSVC()
But the results I obtained were strange. The metrics after passing class_weight
were a bit poor than the metrics without class_weight
.
I was hoping for the exact opposite as I am using class_weight
to make the model better and hence the metrics.
The difference between metrics for both the models was minimal but f1_score
was less for model with class_weight
as compared to model without class_weight
.
I also tried the below snippet:
svc_model = LinearSVC(class_weight='balanced')
but still the f1_score
was less as compared to model without class_weight
.
Below are the metrics I obtained:
LinearSVC w/o class_weight
Accuracy: 89.02, F1 score: 88.92, Precision: 89.17, Recall: 89.02, Misclassification error: 10.98
LinearSVC with class_weight=’balanced’
Accuracy: 87.98, F1 score: 87.89, Precision: 88.3, Recall: 87.98, Misclassification error: 12.02
LinearSVC with class_weight=dict_weights
Accuracy: 87.97, F1 score: 87.87, Precision: 88.34, Recall: 87.97, Misclassification error: 12.03
I assumed that using class_weight
would improve the metrics but instead its deteriorating the metrics. Why is this happening and what should I do? Will it be okay if I don't handle imbalance data?