3

I am sorry for posting this question again but I really need help on this now. I am trying to calculate the AUC of training set of randomForest model in R and there are two ways to calculate this but give different results. The following is a reproductible example of my question. I really appreciate it if someone could help!!!

library(randomForest)
library(pROC)
library(ROCR)
# prep training to binary outcome
train <- iris[iris$Species %in% c('virginica', 'versicolor'),]
train$Species <- droplevels(train$Species)

# build model
rfmodel <- randomForest(Species~., data=train, importance=TRUE, ntree=2)

#the first way to calculate training auc
rf_p_train <- predict(rfmodel, type="prob",newdata = train)[,2]
rf_pr_train <- prediction(rf_p_train, train$Species)
r_auc_train1 <- performance(rf_pr_train, measure = "auc")@y.values[[1]] 
r_auc_train1    #0.9888


#the second way to calculate training auc
rf_p_train <- as.vector(rfmodel$votes[,2])
rf_pr_train <- prediction(rf_p_train, train$Species);
r_auc_train2 <- performance(rf_pr_train, measure = "auc")@y.values[[1]]
r_auc_train2  #0.9175
annadai
  • 35
  • 1
  • 3

1 Answers1

3

To receive the same results for both prediction functions you should exclude the newdata parameter from the first one (explained in the package documentation for the predict function),

rf_p_train <- predict(rfmodel, type="prob")[,2]
rf_pr_train <- prediction(rf_p_train, train$Species)
r_auc_train1 <- performance(rf_pr_train, measure = "auc")@y.values[[1]] 
r_auc_train1

returns,

[1] 0.8655172

The second function returns the OOB votes as explained in the package documentation of the randomForest function,

rf_p_train <- as.vector(rfmodel$votes[,2])
rf_pr_train <- prediction(rf_p_train, train$Species);
r_auc_train2 <- performance(rf_pr_train, measure = "auc")@y.values[[1]]
r_auc_train2

returns (the same result),

[1] 0.8655172
lampros
  • 581
  • 5
  • 12
  • Thanks you very much! I should have looked into the documents. – annadai Oct 17 '17 at 14:00
  • But is it possible that the AUC of test data is higher than that of the train data in randomForest? I have worked on other data and the AUC of test set is always higher than that of the train data. Could you help me with this ? Thanks a lot! – annadai Oct 17 '17 at 14:03
  • @annadai, I think that the proper way would be to post a new question with your (sample) data. – lampros Oct 18 '17 at 12:34
  • The question is here. Thanks a lot for the help! https://stackoverflow.com/q/46812212/8737443 – annadai Oct 18 '17 at 14:10