I have a highly imbalanced binary (yes/no) classification dataset. The dataset currently has appx 0.008% 'yes'.
I need to balance the dataset using SMOTE.
I came across 2 method to deal with the imbalance. The following steps after I have run MinMaxScaler on the variables
from imblearn.pipeline import Pipeline
oversample = SMOTE(sampling_strategy = 0.1, random_state=42)
undersample = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
steps = [('o', oversample), ('u', undersample)]
pipeline = Pipeline(steps=steps)
x_scaled_s, y_s = pipeline.fit_resample(X_scaled, y)
This results in a reduction in the size of the dataset from 2.4million rows to 732000 rows And the imbalance improves from 0.008% to 33.33%
While this approach
sm = SMOTE(random_state=42)
X_sm , y_sm = sm.fit_sample(X_scaled, y)
This increases the number of rows from 2.4million rows to 4.8 million rows and the imbalance is now 50%.
After these steps I need to split data into Train Test datasets....
What is the right approach here?
What factors do I need to consider before I choose any of these methods?
Should I run the X_test, y_test on unsampled data. This would mean, I split the data and do upsampling/undersampling only on the train data.
Thank you.
JD