0

I have an imbalanced classification problem. first, I want to scale the data, then resample it by SMOTE. For preventing data leakage I used a pipeline. My code is:

X_train, X_test, Y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0, stratify=y)
dict = {0: 0.33421052631578946, 1: 0.6657894736842105}
score={'AUC':'roc_auc', 
           'RECALL':'recall',
           'PRECISION':'precision',
           'F1':'f1',
       'ACC':'accuracy',
        'BACC':'balanced_accuracy',
      }

params = [{'randomforestclassifier__n_estimators': [50,100, 200, 250, 300, 350],
 'randomforestclassifier__max_features': ['sqrt','auto', 'log2', 0.8],
 'randomforestclassifier__max_depth': [1,5,10, 20, 30, 40, 50],
 'randomforestclassifier__min_samples_leaf': [1, 2, 4, 5, 10, 20],
 'randomforestclassifier__min_samples_split': [0.1,0.5,1,5, 10, 12]}
  ]

skfold = StratifiedKFold(n_splits=5, random_state=13)

pipeline = make_pipeline(RobustScaler(), TomekLinks(), RandomForestClassifier(random_state=13, class_weight=dict))

#grid search
gcv_rf2 = GridSearchCV(estimator=pipeline, param_grid=params,
            cv=skfold, scoring=score, n_jobs=12,
            refit='F1', verbose=1,
            return_train_score=True)

gcv_rf2.fit(X_train, y_train)
y_hat = gcv_rf2.predict(X_test)

print(classification_report(y_test, y_hat))

problem is that the result for the positive class is not so good, I think it relates to using a not scaled version of X_test to predict(I know about not using resampling for test data, but I'm not sure about scaling)). Is my code correct or there is any problem with it that leads to this not interesting result?

  • why do you want to scale the data? – Nicolas Gervais Mar 10 '20 at 13:37
  • because the data consist of a combination of different features with a wide scale. Also, some features are categorical. – Parvin Khorasani Mar 10 '20 at 13:40
  • 2
    While it's good that you remember to think about it, a decision tree based classifier, like your RandomForestClassifier, actually does not benefit from scaling. In case of an SVM for example, you would need to scale. So unfortunately this means the performance has nothing to do with scaling for this classifier. [Interesting read](https://www.quora.com/Decision-Tree-based-models-dont-require-scaling-How-does-scaling-impact-the-predictions-of-decision-tree-based-models) – Victor Sonck Mar 10 '20 at 13:42
  • you shouldn't rescale data that is based on trees – Nicolas Gervais Mar 10 '20 at 13:44
  • thank you @VictorSonck so altogether is my procedure true or not? and why the results(f-measure) for the positive class are not good(below 0.7)? – Parvin Khorasani Mar 10 '20 at 15:37
  • I think your pipeline is actually quite good and to be honest, I don't really know for sure if the test set is scaled too. I'd think it is. But you can test it easily: Try another model, like and SVC or logistic regression instead of your random forest. If the results are comparable to the RF, the pipeline scales the test set and you can rule out scaling! If the SVC is a lot worse than the RF, scaling is indeed the issue. Let me know if you find anything! – Victor Sonck Mar 10 '20 at 16:00

0 Answers0