How to know if a regression model generated by random forests is good? ( MSE and %Var(y))

Question

I tried to use random forests for regression. The original data is a data frame of 218 rows and 9 columns. The first 8 columns are categorical values ( can be either A, B, C, or D), and the last column V9 has numerical values that can go from 10.2 to 999.87.

When I used random forests on a training set, which represents 2/3 of the original data and which is randomly selected, I got the following results.

>r=randomForest(V9~.,data=trainingData,mytree=4,ntree=1000,importance=TRUE,do.trace=100)
       |      Out-of-bag   |
  Tree |      MSE  %Var(y) |
   100 | 6.927e+04    98.98 |
   200 | 6.874e+04    98.22 |
   300 | 6.822e+04    97.48 |
   400 | 6.812e+04    97.34 |
   500 | 6.839e+04    97.73 |
   600 | 6.852e+04    97.92 |
   700 | 6.826e+04    97.54 |
   800 | 6.815e+04    97.39 |
   900 | 6.803e+04    97.21 |
  1000 | 6.796e+04    97.11 |

I do not know if the high variance percentage means that the model is good or not. Also, since MSE is high, I suspect that the regression model is not really good. Any idea about how to read the results above? Do they mean that the model is not good?

The fact that the %Var explained is so high, and changes so little (in the wrong direction) would certainly make me suspicious. Model assessment is as much an art as a science. How does the model perform on the held out test data? Try looking at a plot of fitted vs. actual data. — joran, May 14 '13 at 17:10

Gorgens · Accepted Answer · 2013-05-14T17:40:08.080

Like @Joran told, %Var is the amount of total variance of Y explained by your random forest model. After the adjust, apply the model to your validation data (1/3 remain):

RFestimated = predict(r, data=ValidationData)

It is interesting also to check the residual:

qqnorm((RFestimated - ValidationData$V9)/sd(RFestimated-ValidationData$V9))

qqline((RFestimated-ValidationData$V9)/sd(RFestimated-ValidationData$V9))

the estimated versus observed values:

plot(ValidationData$V9, RFestimated)

and the RMSE:

RMSE <- (sum((RFestimated-ValidationData$V9)^2)/length(Validation$v9))^(1/2)

I hope this help!

How to know if a regression model generated by random forests is good? ( MSE and %Var(y))

1 Answers1