I have an imbalanced classification problem. first, I want to scale the data, then resample it by SMOTE. For preventing data leakage I used a pipeline. My code is:
X_train, X_test, Y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0, stratify=y)
dict = {0: 0.33421052631578946, 1: 0.6657894736842105}
score={'AUC':'roc_auc',
'RECALL':'recall',
'PRECISION':'precision',
'F1':'f1',
'ACC':'accuracy',
'BACC':'balanced_accuracy',
}
params = [{'randomforestclassifier__n_estimators': [50,100, 200, 250, 300, 350],
'randomforestclassifier__max_features': ['sqrt','auto', 'log2', 0.8],
'randomforestclassifier__max_depth': [1,5,10, 20, 30, 40, 50],
'randomforestclassifier__min_samples_leaf': [1, 2, 4, 5, 10, 20],
'randomforestclassifier__min_samples_split': [0.1,0.5,1,5, 10, 12]}
]
skfold = StratifiedKFold(n_splits=5, random_state=13)
pipeline = make_pipeline(RobustScaler(), TomekLinks(), RandomForestClassifier(random_state=13, class_weight=dict))
#grid search
gcv_rf2 = GridSearchCV(estimator=pipeline, param_grid=params,
cv=skfold, scoring=score, n_jobs=12,
refit='F1', verbose=1,
return_train_score=True)
gcv_rf2.fit(X_train, y_train)
y_hat = gcv_rf2.predict(X_test)
print(classification_report(y_test, y_hat))
problem is that the result for the positive class is not so good, I think it relates to using a not scaled version of X_test
to predict(I know about not using resampling for test data, but I'm not sure about scaling)). Is my code correct or there is any problem with it that leads to this not interesting result?