How to conduct dataset balancing whilst using Pipeline in Sklearn?

Question

I have decided to use Sklearn's Pipeline class to ensure that my model is not prone to data leakage.

However, my multi-class classification dataset is extremely imbalanced (3 classes) and hence need to implement data set balancing. However, I have researched properly but I cannot find an answer as to when and how this dataset rebalancing step should be conducted. Should it be done before scaling or after? Should it be done train/test split or after?

For simplicity's sake, I will not be using SMOTE, but rather random minority upsampling. Any answer would be greatly appreciated.

My code is as follows:

#All necessary packages have already been imported 

x = df['MACD', 'MFI', 'ROC', 'RSI', 'Ultimate Oscillator', 'Williams %R', 'Awesome Oscillator', 'KAMA', 
    'Stochastic Oscillator', 'TSI', 'Volume Accumulator', 'ADI', 'CMF', 'EoM', 'FI', 'VPT','ADX','ADX Negative', 
    'ADX Positive', 'EMA', 'CRA']

y = df['Label']

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

pipe = Pipeline([('sc', StandardScaler()), 
                 ('svc', SVC(decision_function_shape = 'ovr'))])

candidate_parameters = [{'C': [0.0001, 0.001, 0.01, 0.1, 1, 2, 3], 
                        'gamma': [0.0001, 0.001, 0.01, 0.1, 1, 2, 3], 'kernel': ['poly'] 
                        }]

clf = GridSearchCV(estimator = pipe, param_grid = candidate_parameters, cv = 5, n_jobs = -1)

clf.fit(X_train, y_train)

Batuhan B · Accepted Answer · 2020-04-26T22:40:11.223

0

You need to do rebalancing after train/test split. In real world, you do not know what will be your test set so it is better to keep original. You can rebalance only train set to learn better model and then test on original test dataset. (you can also keep validation set as original)

edited Apr 26 '20 at 22:40

answered Apr 05 '20 at 12:56

Batuhan B

1,835
4
29
39

How to conduct dataset balancing whilst using Pipeline in Sklearn?

1 Answers1