17

I used RandomForest for a regression problem. I used importance(rf,type=1) to get the %IncMSE for the variables and one of them has a negative %IncMSE. Does this mean that this variable is bad for the model? I searched the Internet to get some answers but I didn't find a clear one. I also found something strange in the model's summary ( attached below), It seems that only one tree was used although I defined ntrees as 800.

model:

rf<-randomForest(var1~va2+var3+..+var35,data=d7depo,ntree=800,keep.forest=FALSE, importance=TRUE)

summary(rf)
                Length Class  Mode     
call                6  -none- call     
type                1  -none- character
predicted       26917  -none- numeric  
mse               800  -none- numeric  
rsq               800  -none- numeric  
oob.times       26917  -none- numeric  
importance         70  -none- numeric  
importanceSD       35  -none- numeric  
localImportance     0  -none- NULL     
proximity           0  -none- NULL     
ntree               1  -none- numeric  
mtry                1  -none- numeric  
forest              0  -none- NULL     
coefs               0  -none- NULL     
y               26917  -none- numeric  
test                0  -none- NULL     
inbag               0  -none- NULL     
terms               3  terms  call 
smci
  • 32,567
  • 20
  • 113
  • 146
mql4beginner
  • 2,193
  • 5
  • 34
  • 73

1 Answers1

37

Question 1 - why does ntree show 1?:

summary(rf) shows you the length of the objects that are included in your rf variable. That means that rf$ntree is of length 1. If you type on your console rf$tree you will see that it shows 800.

Question 2 - does a negative %IncMSE show a "bad" variable?

IncMSE:
The way this is calculated is by computing the MSE of the whole model initially. Let's call this MSEmod. After this for each one of the variables (columns in your data set) the values are randomly shuffled (permuted) so that a "bad" variable is being created and a new MSE is being calculated. I.e. imagine for that for one column you had rows 1,2,3,4,5. After the permutation these will end up being 4,3,1,2,5 at random. After the permutation (all of the other columns remain exactly the same since we want to examine col1's importance), the new MSE of the model is being calculated, let's call it MSEcol1 (in a similar manner you will have MSEcol2, MSEcol3 but let's keep it simple and only deal with MSEcol1 here). We would expect that since the second MSE was created using a variable completely random, MSEcol1 would be higher than MSEmod (the higher the MSE the worse). Therefore, when we take the difference of the two MSEcol1 - MSEmod we usually expect a positive number. In your case a negative number shows that the random variable worked better, which shows that it probably the variable is not predictive enough i.e. not important.

Keep in mind that this description I gave you is the high level, in reality the two MSE values are scaled and the percentage difference is being calculated. But the high level story is this.

In algorithm form:

  1. Compute model MSE
  2. For each variable in the model:
    • Permute variable
    • Calculate new model MSE according to variable permutation
    • Take the difference between model MSE and new model MSE
  3. Collect the results in a list
  4. Rank variables' importance according to the value of the %IncMSE. The greater the value the better

Hope it is clear now!

LyzandeR
  • 37,047
  • 12
  • 77
  • 87
  • Thank you very much LyzandeR for your detailed and clear answer, Cheers, Ron – mql4beginner Jan 13 '15 at 12:27
  • 2
    Happy to have helped Ron :). If you want to dig in deeper you can have a look [here](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm). This is from Breiman himself (the inventor of random forests) and he explains exactly how they work in plain English without (a lot of) mathematical formulas. This is exactly the reference the rf package used for the implementation. – LyzandeR Jan 13 '15 at 12:32
  • 2
    @LyzandeR surely the value calculated (in the simple explanation) should be `MSEcol1 - MSEmod`, since if `MSEcol1 > MSEmod`, like it's likely to be if if variable 1 is of any use, then the difference would be positive (consequently `MSEmod - MSEcol1` in your answer should then be negative... – stas g Nov 25 '15 at 17:28
  • Thanks @stasg . You are right I made a mistake there, it should be vice versa as you say. Thanks for noticing this. This is what I like about the community that we can check each other's mistakes. Thanks again. – LyzandeR Nov 25 '15 at 17:34
  • @LyzandeR no problem :) – stas g Nov 25 '15 at 18:00