4

Ive been searching the internet for a while now to understand the numeric 'ranking' statistic that rpart assigns to a variable on the variable importance output.

I understand that this number adds to 100 but what exactly is it, what is it called and what does it represent?

I have found it quite useful for ranking many categorical columns to a continuous target variable in the past

Michał Perłakowski
  • 88,409
  • 26
  • 156
  • 177
Mak87
  • 41
  • 1
  • 3

1 Answers1

2

It is calculated for each variable individually and the value is calculated as the sum of the decrease in impurity, it counts both when the variable appear as a primary split and when it appears as a surrogate. Then it is transformed into percentage scoring, the highest values as 100 and consecutively proportional until the lower values. You can read better description of what varialbe importance means in here: https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf and the book of Breiman (Classification and Regression Trees).

Hope this helps!

YDO
  • 49
  • 8