0

I am a newbie in R and I am trying to do my best to create my first model. I am working in a 2- classes random forest project and so far I have programmed the model as follows:

library(randomForest)

set.seed(2015)

randomforest <- randomForest(as.factor(goodkit) ~ ., data=training1, importance=TRUE,ntree=2000)

varImpPlot(randomforest)

prediction <- predict(randomforest, test,type='prob')

print(prediction)

I am not sure why I don't get the overall prediction for my model.I must be missing something in my code. I get the OOB and the prediction per case in the test set but not the overall prediction of the model.

library(pROC)

auc <-roc(test$goodkit,prediction)

print(auc)

This doesn't work at all.

I have been through the pROC manual but I cannot get to understand everything. It would be very helpful if anyone can help with the code or post a link to a good practical sample.

joran
  • 169,992
  • 32
  • 429
  • 468
WillieM
  • 45
  • 1
  • 3
  • 2
    What exactly is the "overall prediction" for the model? Requests for links to tutorials are considered off-topic for this site. It's better to ask a clear programming question. – MrFlick Jul 28 '15 at 14:44
  • By overall prediction I mean a prediction score for my model. Any help/tip with the code for the AUC? – WillieM Jul 28 '15 at 15:30

2 Answers2

1

Using the ROCR package, the following code should work for calculating the AUC:

library(ROCR)
predictedROC <- prediction(prediction[,2], as.factor(test$goodkit))
as.numeric(performance(predictedROC, "auc")@y.values))
  • I have tried your code and I get a few errors. Also, I don't understand it. I think everything is ok till predictedROC <- prediction(prediction, as.factor(test$goodkit)) – WillieM Jul 28 '15 at 17:05
  • I think we did not need to specify "as.numeric" because I used type= "prob" on the prediction. Also, I don't understand the "performance(predictedROC, "auc")", why would you include the predictedROC in the statement when you are trying to calculate that.The last part of the code the "@y.values" seems to be used to do same type of cv?? The definition of "y.values" says: A list in which each entry contains the y values of the curve of this particular cross- validation run. – WillieM Jul 28 '15 at 17:27
  • Sorry, I forgot a small, but crucial part to that code! However, you seemed confused beyond that. A key advanced part of R is being able to build your own special data structures for specific problems. Many packages, inclunding ROCR take advantage of this. S4 objects are these special types of objects. So, we first create a prediction object using the ROCR prediction function, and then use that prediction object when creating a performance object. This is why I used the as.numeric() in the code. – Morris Greenberg Jul 28 '15 at 17:41
  • Morris, thanks a lot for your post. I have modified the code and it is working now. – WillieM Aug 02 '15 at 09:31
0

Your problem is that predict on a randomForest object with type='prob' returns two predictions: each column contains the probability to belong to each class (for binary prediction).

You have to decide which of these predictions to use to build the ROC curve. Fortunately for binary classification they are identical (just reversed):

auc1 <-roc(test$goodkit, prediction[,1])
print(auc1)
auc2 <-roc(test$goodkit, prediction[,2])
print(auc2)
Calimo
  • 7,510
  • 4
  • 39
  • 61
  • Thanks a lot for your post. what makes sense in mmy case is using auc2 as suggested by Morris. – WillieM Aug 02 '15 at 09:45
  • @WillieM please remember to accept an answer if it answer your question, and upvote answers that you found useful. – Calimo Aug 03 '15 at 08:39