Using cross-validation to calculate feature importance "Some Questions"

Question

I am currently working on a project. I already selected my features and want to check their importance. I have some questions if anyone can help me please.

1- Does it make sense if I use RandomForestClassifier with cross-validation to calculate the feature importance?

2- I tried it to calculate the feature Importance using the cross_validate function https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html . The function provides the test_score and train_score results. The results I got with a 10 Fold cross-validation were as follows:

test_score [0.99950158, 0.9997231 , 0.9997231 , 0.99994462, 0.99977848, 0.99983386, 0.99977848, 0.9997231 , 0.99977847, 1.]

train_score [0.99998769, 0.99998154, 0.99997539, 0.99997539, 0.99998154,0.99997539, 0.99998154, 0.99997539, 0.99998154, 0.99997539],

Can anyone explain these results? And what does it indicate?

3- The cross_validate function has a parameter called scoring, which has different scoring values such as accuracy, balanced_accuracy and f1. What does the scoring parameter do? And what do these values mean? And how should I decide which one to choose? I already read the scikit-learn documentation but wasn't clear to me.

Thank you.

score 0 · Answer 1 · answered Dec 03 '19 at 12:29

Your question 1 is slightly out of scope here. For each run (fold) of cross-validation, you will get an array of importance for your features. Then how would you combine those into single importance per feature? There may be outputs which can show that a specific feature is important based on higher scores on different folds. But that may vary.

Now, cross_validate will return the default score of the estimator used inside it, unless the scoring param is set. So if you leave the scoring, it will use RandomForestClassifier's score() method which returns accuracy.

(In scikit, all classifiers will return accuracy in score() and all regressors will return r-squared value)

So for your question 2: the returned scores are accuracies per cv fold.

If you do not want to use accuracy and want some other score, you may set the scoring param incross_validate.

Using cross-validation to calculate feature importance "Some Questions"

1 Answers1