4

I'm having an issue when constructing random forest models using caret. I have a dataset of about 46k rows and 10 columns (one of which is the optimization target). From this dataset, I'm trying to compare different classifiers. I did the following:

ctrl = trainControl(method="boot"
  ,classProbs=TRUE
  ,summaryFunction=twoClassSummary )

#GLM Model:
model.glm = train(x=d[,2:10]
  ,y=d$CONV_BT, method='glm'
  ,trControl=ctrl, metric="ROC"
  ,family="binomial")

#Random Forest Model:
model.rf = train(x=d[,2:10]
  ,y=d$CONV_BT, method='rf'
  ,trControl=ctrl, metric="ROC")

#Naive Bayes Model:
model.nb = train(x=d[,2:10]
  ,y=d$CONV_BT, method='nb'
  ,trControl=ctrl, metric="ROC" )

Then, model.glm and model.nb both look pretty decent. I can look at the 25 bootstrap replications, and each case has an ROC of around .7. However, something appears to be wrong with model.rf, because the reported ROC scores are all around .3. That suggests to me that something is being specified incorrectly, because I could just switch my predictions from the rf model from p to 1-p and my ROC would then be .7, right?

I'm sorry that I can't provide the data (because it's pretty big to upload and it's proprietary). The other bizarre thing is that when I simulate data, I no longer have this issue. Any idea what this could be??? Thanks for your help!

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
random_forest_fanatic
  • 1,232
  • 1
  • 12
  • 30

0 Answers0