SMOTE oversampling creates new data-points

Question

I am trying to solve an imbalanced classification problem, all the input features are categorical. Here are the value counts of each feature:

 for i in X_train.columns:
    print(i+':',X_train[i].value_counts().shape[0])

 Pclass: 3
 Sex: 2
 IsAlone: 2
 Title: 5
 IsCabin: 2
 AgeBin: 4
 FareBin: 4

Applying SMOTE on the training data, after train_test_split. There are new values that are created, which are not present in the X_train dataset.

 from imblearn.over_sampling import SMOTE
 from collections import Counter
 #removing the random_state dosent help
 sm = SMOTE(random_state=0)
 X_res, y_res = sm.fit_resample(X_train, y_train)
 print('Resampled dataset shape %s' % Counter(y_res))

 Resampled dataset shape Counter({1: 381, 0: 381})

Value counts of the resampled dataset:

 Pclass: 16
 Sex: 7
 IsAlone: 2
 Title: 12
 IsCabin: 2
 AgeBin: 4
 FareBin: 4

There are new values being created by using SMOTE, this was also the case with under_sampling new values were created. These new values are not present in the test dataset.

Example:

X_train-Pclass 1-20,2-15,3-40
X_res-Pclass 1-20,0.999999-3,2-11,1.9999999-4,3-34,2.9999999-6

My question:

Why are these values created and do they hold some-importance?
How to deal with them? Should I just round them off or remove them
Is there a way to perform over and under sampling without creating these new values?

Harshil · Answer 1 · 2020-03-17T12:32:48.560

If the class distribution of the dataset is uneven, this may cause trouble in later phases of training and classification as classifiers will have very fewer data to learn features of a particular class.

Unlike normal upsampling, SMOTE makes use of the nearest neighbor algorithm to generate new and synthetic data that can be used to train the models.

As said in this original paper of SMOTE, "The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors."

So yes, these newly generated synthetic data points are important and you do not have to worry about them much. SMOTE is one of the best techniques out there to perform this task, so I would suggest using this.

Consider the following image for example: Figure a has more data points for class 0 while very little for class 1.

As you can see, after applying SMOTE (Figure b), it will generate new data points for minority class (in this case, for class 1) in order to balance the dataset.

Try reading:

hello @Harshil I am wondering does the oversampling created new data points affect to reduce the data prediction of the model when used for future predictions? — Dulangi_Kanchana, Sep 03 '21 at 07:17

SMOTE oversampling creates new data-points

1 Answers1