How to Classify the imbalanced Dataset using SVM

Question

I am using the SVM, and My dataset is imbalanced. I got the result in which it classified Class 0 as 99% and Class 1 as 1%. Is there any way to correctly classify the imbalances dataset using SVM.

score 0 · Answer 1 · answered Jun 06 '20 at 11:15

There can be many ways you can work with imbalanced dataset. I have most commonly used a couple of these:

Penalize for wrong output: If class A has much less samples than class B, then you can increase the penalty incurred for incorrect classification of class A.
Use the SMOTE module. It basically takes the convex combination of two points in a given class and assigns it the same label as the two chosen points.

Other possible options can include looking at different evaluation metrics and validation strategies like Stratified K Fold.

score 0 · Answer 2 · answered Jun 06 '20 at 11:22

There are several ways to adapt an unbalanced dataset to use it for regression/classification. Here I'm going to describe the oversampling and undersampling methods.

In oversampling, you duplicate the data for the minority class, even when you have rows in your data that are exactly the same. In undersampling you pick all the data that has class 1 and pick the same number of samples that have label 0 (this is only a good option if you have a high number of samples).

You could also use a mix of the two. Something like:

def obtain_equal_idx(idx_0, idx_1, n_samples, ratio_unbalance):
    idx_1_repeated = np.repeat(idx_1, (n_samples // len(idx_1)) + 1)

    idx_0s = np.random.choice(idx_0, ratio_unbalance * (n_samples // 2), replace=False)
    idx_delay = np.random.choice(idx_1_repeated, n_samples // 2, replace=False)
    return np.concatenate([idx_0s, idx_delay])

With idx_0 being the indexes of your whole dataset labeled as 0, idx_1 being the same for the data labeled as 1, n_samples is the number of samples you want to get, and ratio_unbalance is a number (usually 2 or 3) that allow the data you get to be a bit unbalanced so that your model knows that the data is not completely balanced.

How to Classify the imbalanced Dataset using SVM

2 Answers2