6

I'm wondering if it is possible for Sklearn's RFECV to select a fixed number of the most important features. For example, working on a dataset with 617 features, I have been trying to use RFECV to see which 5 of those features are the most significant. However, RFECV does not have the parameter 'n_features_to_select', unlike RFE (which confuses me). How should I deal with this?

  • In addition to the answer below, [look at this example](http://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html#sphx-glr-auto-examples-feature-selection-plot-rfe-with-cross-validation-py) which illustrates the working of RFECV. As the answer suggested, RFECV tunes the number of features itself, so dont provide features to select. – Vivek Kumar Jul 05 '18 at 05:50

2 Answers2

5

According to this quora post

The RFECV object helps to tune or find this n_features parameter using cross-validation. For every step where "step" number of features are eliminated, it calculates the score on the validation data. The number of features left at the step which gives the maximum score on the validation data, is considered to be "the best n_features" of your data.

Which says RFECV determines the optimal number of features (n_features) to get best result.
The fitted RFECV object contains an attribute ranking_ with feature ranking, and support_ mask to select optimal features found.
However if you MUST select top n_features from RFECV you can use the ranking_ attribute

optimal_features = X[:, selector.support_] # selector is a RFECV fitted object

n = 6 # to select top 6 features
feature_ranks = selector.ranking_  # selector is a RFECV fitted object
feature_ranks_with_idx = enumerate(feature_ranks)
sorted_ranks_with_idx = sorted(feature_ranks_with_idx, key=lambda x: x[1])
top_n_idx = [idx for idx, rnk in sorted_ranks_with_idx[:n]]

top_n_features = X[:5, top_n_idx]

Reference: sklearn documentation, Quora post

shanmuga
  • 4,329
  • 2
  • 21
  • 35
2

I know that this is an old question, but I think it is still relevant.

I don't think shanmuga's solution is right because features within the same rank are not ordered by importance. That is, if selector.ranking_ has 3 features with rank 1, I don't think it is necessarily true that the first in the list is more important than the second or third.

A naive solution to this problem would be to run RFE while setting n_features_to_select to the desired number and "manually" cross-validate it.

In case you want n features from the optimal m features (with n<m) you can do:

# selector is a RFECV fitted object
feature_importance = selector.estimator_.feature_importances_  # or coef_
feature_importance_sorted = sorted(enumerate(feature_importance), key=lambda x: x[1])
top_n_idx = [idx for idx, _ in feature_importance_sorted[:n]]

You should note that multiple features may have the same importance or coefficient, which you might leave out with this approach.

Ricardo Mendes
  • 163
  • 1
  • 8