scikit-learn feature ranking returns identical values

Question

I'm using scikit-learn's RFECV class to perform feature selection. I'm interested in identifying the relative importance of a bunch of variables. However, scikit-learn returns the same ranking (1) for multiple variables. This can also be seen in their example code:

>>> from sklearn.datasets import make_friedman1
>>> from sklearn.feature_selection import RFECV
>>> from sklearn.svm import SVR
>>> X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
>>> estimator = SVR(kernel="linear")
>>> selector = RFECV(estimator, step=1, cv=5)
>>> selector = selector.fit(X, y)
>>> selector.support_ 
array([ True,  True,  True,  True,  True, False, False, False, False,
       False])
>>> selector.ranking_
array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5])

Is there a way I can make scikit-learn also identify the relative importance between the top features?

I'm happy to increase the number of trees or similar if that's needed. Related to this, is there a way to see the confidence of this ranking?

I think this question would me more appropriate for https://stats.stackexchange.com/ — Thiago Barcala, Jun 07 '19 at 04:42
Fair. I'd be happy to have it moved. If any moderators see this, please feel free to move it :) — pir, Jun 07 '19 at 14:50
I think most scikit-learn questions are on SO, so I think I'd keep it here. — Andreas Mueller, Jun 07 '19 at 18:13
You are using linear SVR, you can try changing it to 'poly'. — cho_uc, Jun 11 '19 at 22:27

score 6 · Accepted Answer · answered Jun 07 '19 at 18:12

The goal of RFECV is to select the optimum number of features, so it does cross-validation over the number of features selected. In your case, it selected to keep 5 features. Then the model is refit on the whole data set until only 5 features remain. These are not removed, so they are not ranked in RFE.

You could get a ranking for all features by just running RFE

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFE(estimator, step=1, n_features_to_select=1)
selector = selector.fit(X, y)
selector.ranking_

array([ 4, 3, 5, 1, 2, 10, 8, 7, 6, 9])

You might ask yourself why the ranking from the cross-validation is not kept, which computed a ranking for all features. However, for each split in the cross-validation, the features might have been ranked differently. So alternatively RFECV could return 5 different rankings and you could compare them. That's not the interface, though (but would also be easy to accomplish with RFE and doing the cv yourself).

On a different note, this might not be the best way to compute the influence of the features and looking at coefficients directly or maybe permutation importance might be more informative.

I see, thanks. I'm always a bit skeptical of using feature importances (especially when reading e.g. https://medium.com/turo-engineering/how-not-to-use-random-forest-265a19a68576). At least with CV and a metric, you know exactly what you're measuring :) — pir, Jun 07 '19 at 21:05
How come there's no sklearn method that simply iterates over all the features, and removes them one by one by examining CV performance? That's the definition of backward feature selection I know from textbooks, and it should produce a robust ranking across all features. That's what I expected `RFECV` to be doing. — pir, Jun 07 '19 at 21:08
Backward feature selection is not the same as recursive feature selection. RFE is cheaper, it uses feature importance / coefficients. There's a PR for backward feature selection or you can use mlxtend. How come: no one had time / prioritized implementing and reviewing it. — Andreas Mueller, Jun 09 '19 at 02:28

scikit-learn feature ranking returns identical values

1 Answers1