16

I know decision tree has feature_importance attribute calculated by Gini and it could be used to check which features are more important.

However, for application in scikit-learn or Spark, it only accepts numeric attribute, so I have to transfer string attribute to numeric attribute and then do one-hot encoder on that. When features are put into decision tree model, it's 0-1 encoded other than original format, my question is, how to explain feature importance for original attributes? should I avoid one-hot encoder when try to explain feature importance?

Thanks.

linpingta
  • 2,324
  • 2
  • 18
  • 36
  • 2
    You could try to estimate feature importances of original feature as sum of feature importances of corresponding features after OHE. To do this you will have to understand what OHE is created by some particular feature. – Ibraim Ganiev Oct 15 '16 at 07:40
  • @IbraimGaniev thanks for your help:) However, for OHE, it's difficult to know how many 0-1 variable in each feature... I am not sure whether it's standard way to do that... – linpingta Oct 15 '16 at 12:36
  • 2
    well, OHE stores feature_indices_ parameter, from which you can tell which exactly categorial features were decomposed to which binary features. – Ibraim Ganiev Oct 15 '16 at 15:33

1 Answers1

4

Conceptually, you may want to use something along the lines of permutation importance. The basic idea, is that you take your original dataset, and randomly shuffle the values of each column 1 at a time. Then, you score your perturbed data with the model and compare the performance to the original performance. If done 1 column at a time, you can assess the performance hit you take by destroying each variable, indexing it to the variable that had the most loss (which would become 1, or 100%). If you can do this to your original dataset, prior to the 1 hot encoding, then you'll be getting an importance measure that groups them together overall.

Josh
  • 1,493
  • 1
  • 13
  • 24