2

I have decided to use Sklearn's Pipeline class to ensure that my model is not prone to data leakage.

However, my multi-class classification dataset is extremely imbalanced (3 classes) and hence need to implement data set balancing. However, I have researched properly but I cannot find an answer as to when and how this dataset rebalancing step should be conducted. Should it be done before scaling or after? Should it be done train/test split or after?

For simplicity's sake, I will not be using SMOTE, but rather random minority upsampling. Any answer would be greatly appreciated.

My code is as follows:

#All necessary packages have already been imported 

x = df['MACD', 'MFI', 'ROC', 'RSI', 'Ultimate Oscillator', 'Williams %R', 'Awesome Oscillator', 'KAMA', 
    'Stochastic Oscillator', 'TSI', 'Volume Accumulator', 'ADI', 'CMF', 'EoM', 'FI', 'VPT','ADX','ADX Negative', 
    'ADX Positive', 'EMA', 'CRA']

y = df['Label']

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

pipe = Pipeline([('sc', StandardScaler()), 
                 ('svc', SVC(decision_function_shape = 'ovr'))])

candidate_parameters = [{'C': [0.0001, 0.001, 0.01, 0.1, 1, 2, 3], 
                        'gamma': [0.0001, 0.001, 0.01, 0.1, 1, 2, 3], 'kernel': ['poly'] 
                        }]

clf = GridSearchCV(estimator = pipe, param_grid = candidate_parameters, cv = 5, n_jobs = -1)

clf.fit(X_train, y_train)
Johnny
  • 320
  • 3
  • 12

1 Answers1

0

You need to do rebalancing after train/test split. In real world, you do not know what will be your test set so it is better to keep original. You can rebalance only train set to learn better model and then test on original test dataset. (you can also keep validation set as original)

Batuhan B
  • 1,835
  • 4
  • 29
  • 39