3

I feel like I'm missing something very basic here.

I've run a random forest regression:

INTERP.rf<-randomForest(y~.,data=df,importance=T,mtry=3,ntree=300)

and then extracted the predictions for the training set:

rf.predict<-predict(INTERP.rf,df,type="response")

the %var from rf.predict looked too low so I checked it:

MSE.rf<-sum((rf.predict-y)^2)/length(y)

...and got a wildly different answer than an inspection of the rf.predict object gave.

Please can someone highlight my error?

joran
  • 169,992
  • 32
  • 429
  • 468
Lisa Avery
  • 51
  • 4
  • 2
    You are predicting on the data used to build the model. That is bad and is generally never done (overfitting). By default, `randomForest` reports the out-of-bag (OOB) errors. – joran Jun 12 '15 at 15:35
  • @joran - I agree that evaluating the model on the data used to build the model can result in overfitting. But it's not at all a bad idea to verify the output generated by using `predict` on a `randomForest` object. – davechilders Jun 12 '15 at 15:42
  • @DMC You're right, I wrote that comment a little fast. It's "bad" with respect to measuring predictive accuracy. – joran Jun 12 '15 at 15:43
  • I would just like to add that the above comments are being very careless and not very helpful with the language that they're using. It's most important, in any statistical analysis, that you know what you are asking for, what you are receiving, and the implications of both when you are running any function. I think that the point is clear that the OP was misunderstanding how `randomForest()` predictions work, both OOB and for "new" or original data. It would be more helpful to link to docs and examples explaining how functions work rather than saying something is "garbage." – Forrest R. Stevens Jun 12 '15 at 16:31
  • Thank you @Vlo! I was not aware of this distinction and this has solved my problem. – Lisa Avery Jun 12 '15 at 18:10

1 Answers1

1

The correct way to do this is to use:

rf.predict<-predict(INTERP.rf)

I was not aware that I needed to use predict.randomforest(model) as opposed to predict.randomForest(model,trainingData) to get the OOB predictions.

Thank you to @joran and @Vlo for helpful comments

Lisa Avery
  • 51
  • 4