R package randomForest reports mean squared errors for each tree in the forest. I need, however, a measure of confidence for each case in the data. Since randomForest calculates the casewise predictions by averaging the predictions of the single trees, I guess that it should also be possible to calculate a casewise standard error and thus a confidence interval. Can this be done using the output randomForest object (if so: how?) or do I have to dig into the source code?
Asked
Active
Viewed 3,946 times
1 Answers
7
No need to dig into the source code. You only need to read the documentation. ?predict.randomForest
states that one of its arguments is called predict.all
:
predict.all Should the predictions of all trees be kept?
So setting that to TRUE
will keep a prediction for each case, for each tree, which you can then use to calculate standard error for each case.
I have recently been made aware of this paper by Stefan Wager, Trevor Hastie and Brad Efron which investigates more rigorously the idea of standard errors for the predictions generated by random forests (and other bagged predictors).

joran
- 169,992
- 32
- 429
- 468
-
Sorry for asking here. But just to be sure , here RandomForest type is predictions, otherwise we can't speak about confidence interval, isn't? – agstudy Feb 05 '13 at 15:32
-
@agstudy Not sure I follow. I will readily grant that the _statistical_ meaning of prediction intervals may very well be questionable here, but on some level the predictions are just averages, so calculating a "confidence interval" for each one in a naive way really does just amount to calculating the CI for a mean. Whether the resulting interval means anything useful is obviously a separate question... – joran Feb 05 '13 at 15:39
-
Thanks. My question is because we can randomForest performs classification or regression( object$type ='predictions'). So is calculating CI in the case of classification, has any statistical meaning? – agstudy Feb 05 '13 at 15:46
-
@agstudy Oh, I see. Yeah, this answer (and really, the question) only makes much sense if they are building a regression tree. If they're doing classification, this whole idea sort of breaks down. – joran Feb 05 '13 at 15:47
-
Thanks. I am asking for evidence because I am not statistician. Otherwise does the answer of @Eric [Here](http://stats.stackexchange.com/questions/13869/compare-r-squared-from-two-different-random-forest-models) is a beginning of answer? – agstudy Feb 05 '13 at 16:05
-
@joran - I think I read the documentation a dozen of times but I did not recognize that this is the option that I was looking for... – user7417 Feb 06 '13 at 12:51
-
@joran - If it is correct to calculate casewise predictions (in a regression context, of course) as averages of the single tree predictions, than IMHO it should also be meaningful to characterise the variation of the tree predictions around these averages. – user7417 Feb 06 '13 at 13:02
-
@joran - Browsing through the stackoverflow it became clear to me that in fact I'm looking for *prediction intervals*, something that expresses uncertainty surrounding the predicted y of a single case (while a confidence interval expresses uncertainty about the expected value of y). The predict-method for randomForests does not allow anything in that direction: no options such as `interval="prediction"` or `interval="confidence"` as for linear models. Any cue how prediction intervals could be calculated for randomForests predictions? – user7417 Feb 06 '13 at 15:41