How to use sklearn RFECV to select the optimal features to pass to a dimensionality reduction step before fitting my estimator

Question

How can I use sklearn RFECV method to select the optimal features to pass to a LinearDiscriminantAnalysis(n_components=2) method for dimensionality reduction, before fitting my estimator using a KNN.

pipeline = make_pipeline(Normalizer(), LinearDiscriminantAnalysis(n_components=2), KNeighborsClassifier(n_neighbors=10))

X = self.dataset
y = self.postures

min_features_to_select = 1  # Minimum number of features to consider
rfecv = RFECV(svc, step=1, cv=None, scoring='f1_weighted', min_features_to_select=min_features_to_select)

rfecv.fit(X, y)

print(rfecv.support_)
print(rfecv.ranking_)
print("Optimal number of features : %d" % rfecv.n_features_)

Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(min_features_to_select,
len(rfecv.grid_scores_) + min_features_to_select),
rfecv.grid_scores_)
plt.show()

I get the following error from this code. If I run this code without the LinearDiscriminantAnalysis() step then it works, but this an important part of my processing.

*** ValueError: when `importance_getter=='auto'`, the underlying estimator Pipeline should have `coef_` or `feature_importances_` attribute. Either pass a fitted estimator to feature selector or call fit before calling transform.

afsharov · Accepted Answer · 2021-05-19T13:53:02.313

1

Your approach has an overall problem: the KNeighborsClassifier does not have an intrinsic measure of feature importance. Thus, it is not compatible with RFECV as its documentation states about the classifier:

A supervised learning estimator with a fit method that provides information about feature importance either through a coef_ attribute or through a feature_importances_ attribute.

You will definitely fail with KNeighborsClassifier. You definitely need another classifier like RandomForestClassifier or SVC.

If you can shoose another classifier, your pipeline still needs to expose the feature importance of the estimator in your pipeline. For this you can refer to this answer here which defines a custom pipeline for this purpose:

class Mypipeline(Pipeline):
    @property
    def coef_(self):
        return self._final_estimator.coef_
    @property
    def feature_importances_(self):
        return self._final_estimator.feature_importances_

Define your pipeline like:

pipeline = MyPipeline([
    ('normalizer', Normalizer()),
    ('ldm', LinearDiscriminantAnalysis(n_components=2)),
    ('rf', RandomForestClassifier())
])

and it should work.

edited May 19 '21 at 13:53

answered May 19 '21 at 12:07

afsharov

4,774
2
10
27

Sorry, I don't think I was clear enough. I don't know how to get RFECV to work using the pipeline in my code. I keep getting the values error displayed above. However, if I don't include the LinearDiscriminantAnalysis() step it does work. – Ben-Jamin-Griff May 19 '21 at 13:19
But then your approach is all in naught since `KNeighborsClassifier` does not have an intrinsic feature importance measure. You need a classifier that can compute such a measure in order to use `RFECV`. – afsharov May 19 '21 at 13:46
You may have a look at my updated answer. – afsharov May 19 '21 at 13:54
Thanks you that makes sense now – Ben-Jamin-Griff May 19 '21 at 14:08

How to use sklearn RFECV to select the optimal features to pass to a dimensionality reduction step before fitting my estimator

1 Answers1