10

I am using RandomForestClassifier() with 10 fold cross validation as follows.

clf=RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
accuracy = cross_val_score(clf, X, y, cv=k_fold, scoring = 'accuracy')
print(accuracy.mean())

I want to identify the important features in my feature space. It seems to be straightforward to get the feature importance for single classification as follows.

print("Features sorted by their score:")
feature_importances = pd.DataFrame(clf.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances)

However, I could not find how to perform feature importance for cross validation in sklearn.

In summary, I want to identify the most effective features (e.g., by using an average importance score) in the 10-folds of cross validation.

I am happy to provide more details if needed.

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
EmJ
  • 4,398
  • 9
  • 44
  • 105
  • You get feature importance for a single fitted classifier. If you do cross validation you get multiple classifiers (10 in your case). Are you looking for the feature importance for each individual classifier or for all of them together? – MB-F Apr 02 '19 at 07:28
  • @kazemakase Thanks a lot for the comment. I look feature importance of all of them together :) – EmJ Apr 02 '19 at 07:56
  • In that case you don't really need cross validation. You can just fit a classifier on the whole data set and take feature importance from that. – MB-F Apr 02 '19 at 08:05
  • @MB-F in some cases you want to learn the importance in each fold separately, because of folds consisting of different population balances. – Helen Jun 23 '22 at 02:40

1 Answers1

12

cross_val_score() does not return the estimators for each combination of train-test folds.

You need to use cross_validate() and set return_estimator =True.

Here is an working example:

from sklearn import datasets
from sklearn.model_selection import cross_validate
from sklearn.svm import LinearSVC
from sklearn.ensemble import  RandomForestClassifier
import pandas as pd

diabetes = datasets.load_diabetes()
X, y = diabetes.data, diabetes.target

clf=RandomForestClassifier(n_estimators =10, random_state = 42, class_weight="balanced")
output = cross_validate(clf, X, y, cv=2, scoring = 'accuracy', return_estimator =True)
for idx,estimator in enumerate(output['estimator']):
    print("Features sorted by their score for estimator {}:".format(idx))
    feature_importances = pd.DataFrame(estimator.feature_importances_,
                                       index = diabetes.feature_names,
                                        columns=['importance']).sort_values('importance', ascending=False)
    print(feature_importances)

Output:

Features sorted by their score for estimator 0:
     importance
s6     0.137735
age    0.130152
s5     0.114561
s2     0.113683
s3     0.112952
bmi    0.111057
bp     0.108682
s1     0.090763
s4     0.056805
sex    0.023609
Features sorted by their score for estimator 1:
     importance
age    0.129671
bmi    0.125706
s2     0.125304
s1     0.113903
bp     0.111979
s6     0.110505
s5     0.106099
s3     0.098392
s4     0.054542
sex    0.023900
Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • 1
    Thanks a lot for the great answer. However, I am still not clear what is the diffrence between `cross_val_score()``and `cross_validate()`. Can we use `cross_validate()` to get accuracy, precision, recall and f-measure? :) – EmJ Apr 02 '19 at 08:38
  • 2
    Yes, you can get those values as well using `cross_validate()` based on the value you set for scoring. Actually `cross_val_score` internally calls the `cross_validate`. Hence, if you want more functionalities go for `cross_validate`. – Venkatachalam Apr 02 '19 at 08:41
  • 1
    @ai_learning This is really helpful, thank you. I am trying to use a pipeline because within each fold also want to do feature selection and normalization. However I then get the error message `AttributeError: 'Pipeline' object has no attribute 'feature_importances_'`. Do you know how I could get around this? – firefly Jun 09 '19 at 16:25
  • Looks like an interesting question, could please add more details and ask this as a separate question? – Venkatachalam Jun 10 '19 at 06:37
  • 1
    @ai_learning great, thank you, have posted here: https://stackoverflow.com/questions/56562208/how-to-extract-important-features-after-k-fold-cross-validation-with-or-without. (sorry for slow reply, missed notification of your comment). – firefly Jun 13 '19 at 12:43
  • @Venkatachalam Hi. I am wondering if you could answer this question for me related to this topic. If I want to plot the feature importances of a model put through cross validation, is it okay to take average of feature importances across all folds for each feature? Just like how we do with score and standard deviation – callmeanythingyouwant Sep 16 '20 at 04:38
  • ya, sounds like a reasonble thing to do. If you have specifics to this to it, ask it as a new question. – Venkatachalam Sep 16 '20 at 08:11
  • Thanks so much. What if you used a pipeline to to transform categorical features. So you have output = cross_validate(pipe, X, y, cv=2, scoring = 'accuracy', return_estimator =True) – pmanDS Jan 18 '22 at 08:45
  • you could use the pipeline's [get_feature_names_out_](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline.get_feature_names_out) method for it. – Venkatachalam Jan 18 '22 at 10:13