0

I'm currently working on a project in which I use Random Forest. I want to know the feature importance of all covariates and want to use MeanDecreaseGini for this.

I really don't understand why there can be values greater than 0.5. The Gini index can't be greater than 0.5, so the decrease shouldn't be either. When you average over all the values in the nodes in the forest where a specific covariate was used, the mean decrease in Gini can't be greater than 0.5. Can anybody say, where my mistake in thinking is?

Here is an example for a code where the results for MeanDecreaseGini are much greater than 0.5:

install.packages("randomForest")
library(randomForest)

set.seed(1)
a <- as.factor(c(rep(1, 20), rep(0, 30)))
b <- c(rnorm(20, 5, 2), rnorm(30, 4, 1))
c <- c(rnorm(25, 0, 1), rnorm(25, 1, 2))
data <- data.frame(a = a, b = b, c = c)

rf <- randomForest(data = data, a ~ b + c, importance = T, ntree = 300)
importance(rf)
gung - Reinstate Monica
  • 11,583
  • 7
  • 60
  • 79
TobiSonne
  • 1,044
  • 7
  • 22
  • What makes you think the gini index can't be greater than 0.5? – Dason Jun 28 '17 at 15:41
  • if the target has two classes and at the beginning, there are n/2 from the one class and n/2 from the other class, gini index is 2* (n/2)/2 * (1-(n/2)/n) = 2*0.5*0.5=0.5. The "worst" distribution. isn't that correct? – TobiSonne Jun 28 '17 at 16:16

0 Answers0