4

Currently I am building a classifier with heavily imbalanced data. I am using the imblearn pipeline to first to StandardScaling, SMOTE, and then the classification with gridSearchCV. This ensures that the upsampling is done during the cross-validation. Now I want to include feature_selection into my pipeline. How should I include this step into the pipeline?

model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', RandomForestClassifier())
    ])

param_grid = { 
    'classification__n_estimators': [10, 20, 50],
    'classification__max_depth' : [2,3,5]
}

gridsearch_model = GridSearchCV(model, param_grid, cv = 4, scoring = make_scorer(recall_score))
gridsearch_model.fit(X_train, y_train)
predictions = gridsearch_model.predict(X_test)
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
Joost Jansen
  • 61
  • 1
  • 3

3 Answers3

6

It does not necessarily make sense to include feature selection in a pipeline where your model is a random forest(RF). This is because the max_depth and max_features arguments of the RF model essentially control the amounts of features included when building the individual trees (the max depth of n just says that each tree in your forest will be built for n nodes, each with a split consisting of a combination of max_features amount of features). Check https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.

You can simply investigate your trained model for the top ranked features. When training an individual tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. So then you actually don't need to retrain the forest for different feature sets, because the feature importance (already computed in the sklearn model) tells you all the info you'd need.


P.S. I would not waste time grid searching n_estimators either, because more trees will result in better accuracy. More trees means more computational cost and after a certain number of trees, the improvement is too small, so maybe you have to worry about that, but otherwise you will gain performance from a large-ish number of n_estimator and you're not really in trouble of overfitting either.

0

do you mean feature selection form sklearn? https://scikit-learn.org/stable/modules/feature_selection.html

You can run it in the beginning. You will basically adjust your columns of X (X_train, and X_test accordingly). It is important that you fit your feature selection only with the training data (as your test data should be unseen at that point in time).

How should I include this step into the pipeline?

so you should run it before your code.

PV8
  • 5,799
  • 7
  • 43
  • 87
0

There is no "how" as if there is a concrete recipe, it depends on your goal.

If you want to check which set of features gives you the best performance (according to your metrics, here recall), you could use sklearn's sklearn.feature_selection.RFE (Recursive Feature Elimination) or it's cross validation variant sklearn.feature_selection.RFECV.

The first one fit's your model with whole set of features, measures their importance and prunes the least impactful ones. This operation continues until the desired number of features are left. It is quite computationally intensive though.

Second one starts with all features and removes step features every time trying out all possible combinations of learned models. This continues until min_features_to_select is hit. It is VERY computationally intensive, way more than the first one.

As this operation is rather infeasible to use in connection with hyperparameters search, you should do it with a fixed set of defaults before GridSearchCV or after you have found some suitable values with it. In the first case, features choice will not depend on the hyperparams you've found, while for the second case the influence might be quite high. Both ways are correct but would probably yield different results and models.

You can read more about RFECV and RFE in this StackOverflow answer.

Szymon Maszke
  • 22,747
  • 4
  • 43
  • 83