3

I use random forest package in R for regression, it gives me two kind of information: Mean of squared residuals and % Var explained. But I wanna calculate the RMSE and R^2 of the training and test sets, can anyone help me how can I find these kind of information?

Hamid Pourjam
  • 20,441
  • 9
  • 58
  • 74
Farhaneh Moradi
  • 109
  • 1
  • 1
  • 10
  • 3
    Please provide a minimally reproducible example of your code with library dependencies and any functions you used. – mlegge Apr 21 '15 at 14:48

1 Answers1

3

Sorry this is not a specific answer, but I do not have enough cred to leave a comment.

It is tough to say how you may get at what you want without a reproducible example. However, if you used the xtest= and ytest= arguments in the call to randomForest (assuming you are using the "randomForest" package), then what you are looking for should be a part of the resulting randomForest object. What you want to look in is the test part of the resulting random forest list.

An attempted example:

rf.results <- randomForest( whatever arguments )
rf.results$test$mse  # mse (maybe you can take the square root to get rmse)
rf.results$test$rsq  # pseudo-R2 for random forest

If you have the random forest package loaded you can validate this information as well as do some exploration yourself with ?randomForest. The "Value" section of the documentation details the object that results from a call to randomForest and where you can find various performance metrics.

BazookaDave
  • 1,192
  • 9
  • 16
  • Thank you, but two more question: 1: with rf.results$mse, can I calculate the mse and rsq of training set? and the second question, why I got a vector as results!!!? in fact, I need just one real as mse and rsq. but it gives me one mse and one rsq for each sample of data, I think. what should I do? – Farhaneh Moradi Apr 22 '15 at 07:11
  • `rf.results$mse` will give you the mse of the training set and `rf.results$rsq` will give the pseudo-R2 for the training set. The mse and rsq from rf.results$test are performance measures on the validation set. You should use these to find the optimal number of trees to have in the forest.The reason you get a vector of results is because of the `ntree` argument. You get performance measures for the random forests consisting of 1 to `ntree` trees. – BazookaDave Apr 22 '15 at 15:29