8

I'm using scikit-learn's RFECV class to perform feature selection. I'm interested in identifying the relative importance of a bunch of variables. However, scikit-learn returns the same ranking (1) for multiple variables. This can also be seen in their example code:

>>> from sklearn.datasets import make_friedman1
>>> from sklearn.feature_selection import RFECV
>>> from sklearn.svm import SVR
>>> X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
>>> estimator = SVR(kernel="linear")
>>> selector = RFECV(estimator, step=1, cv=5)
>>> selector = selector.fit(X, y)
>>> selector.support_ 
array([ True,  True,  True,  True,  True, False, False, False, False,
       False])
>>> selector.ranking_
array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5])

Is there a way I can make scikit-learn also identify the relative importance between the top features?

I'm happy to increase the number of trees or similar if that's needed. Related to this, is there a way to see the confidence of this ranking?

pir
  • 5,513
  • 12
  • 63
  • 101

1 Answers1

6

The goal of RFECV is to select the optimum number of features, so it does cross-validation over the number of features selected. In your case, it selected to keep 5 features. Then the model is refit on the whole data set until only 5 features remain. These are not removed, so they are not ranked in RFE.

You could get a ranking for all features by just running RFE

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFE(estimator, step=1, n_features_to_select=1)
selector = selector.fit(X, y)
selector.ranking_

array([ 4, 3, 5, 1, 2, 10, 8, 7, 6, 9])

You might ask yourself why the ranking from the cross-validation is not kept, which computed a ranking for all features. However, for each split in the cross-validation, the features might have been ranked differently. So alternatively RFECV could return 5 different rankings and you could compare them. That's not the interface, though (but would also be easy to accomplish with RFE and doing the cv yourself).

On a different note, this might not be the best way to compute the influence of the features and looking at coefficients directly or maybe permutation importance might be more informative.

Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74
  • I see, thanks. I'm always a bit skeptical of using feature importances (especially when reading e.g. https://medium.com/turo-engineering/how-not-to-use-random-forest-265a19a68576). At least with CV and a metric, you know exactly what you're measuring :) – pir Jun 07 '19 at 21:05
  • How come there's no sklearn method that simply iterates over all the features, and removes them one by one by examining CV performance? That's the definition of backward feature selection I know from textbooks, and it should produce a robust ranking across all features. That's what I expected `RFECV` to be doing. – pir Jun 07 '19 at 21:08
  • 1
    Backward feature selection is not the same as recursive feature selection. RFE is cheaper, it uses feature importance / coefficients. There's a PR for backward feature selection or you can use mlxtend. How come: no one had time / prioritized implementing and reviewing it. – Andreas Mueller Jun 09 '19 at 02:28