0

I've noticed when running penalized logistic regression in caret with the glmnet package, the model predictions are reclassified as 0 or 1 outcomes:

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
train_control <- trainControl(method="cv", number=10, savePredictions = TRUE)
glmnetGrid <- expand.grid(alpha=c(0, .5, 1), lambda=c(.1, 1, 10))
model<- train(as.factor(admit) ~ ., data=mydata, trControl=train_control, method="glmnet", family="binomial", tuneGrid=glmnetGrid, metric="Accuracy", preProcess=c("center","scale"))
model

glmnet 

400 samples
  3 predictor
  2 classes: '0', '1' 

Pre-processing: centered (3), scaled (3) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 360, 360, 361, 359, 360, 361, ... 
Resampling results across tuning parameters:

  alpha  lambda  Accuracy      Kappa          Accuracy SD     Kappa SD     
  0.0     0.1    0.6923233271  0.09027099758  0.018975211636  0.06988057154
  0.0     1.0    0.6825703565  0.00000000000  0.007557700521  0.00000000000
  0.0    10.0    0.6825703565  0.00000000000  0.007557700521  0.00000000000
  0.5     0.1    0.6825703565  0.00000000000  0.007557700521  0.00000000000
  0.5     1.0    0.6825703565  0.00000000000  0.007557700521  0.00000000000
  0.5    10.0    0.6825703565  0.00000000000  0.007557700521  0.00000000000
  1.0     0.1    0.6825703565  0.00000000000  0.007557700521  0.00000000000
  1.0     1.0    0.6825703565  0.00000000000  0.007557700521  0.00000000000
  1.0    10.0    0.6825703565  0.00000000000  0.007557700521  0.00000000000

Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were alpha = 0 and lambda = 0.1. 
> head(model$pred)
  pred obs rowIndex alpha lambda Resample
1    0   0       16     0     10   Fold01
2    0   0       17     0     10   Fold01
3    0   0       24     0     10   Fold01
4    0   1       46     0     10   Fold01
5    0   0       69     0     10   Fold01
6    0   0       84     0     10   Fold01

> summary(model$pred)
 pred     obs         rowIndex          alpha         lambda       Resample        
 0:3576   0:2457   Min.   :  1.00   Min.   :0.0   Min.   : 0.1   Length:3600       
 1:  24   1:1143   1st Qu.:100.75   1st Qu.:0.0   1st Qu.: 0.1   Class :character  
                   Median :200.50   Median :0.5   Median : 1.0   Mode  :character  
                   Mean   :200.50   Mean   :0.5   Mean   : 3.7                     
                   3rd Qu.:300.25   3rd Qu.:1.0   3rd Qu.:10.0                     
                   Max.   :400.00   Max.   :1.0   Max.   :10.0    

Is it possible to obtain the raw predicted probabilities = exp(logit(y)) rather than 0/1 predicted outcomes?

phiver
  • 23,048
  • 14
  • 44
  • 56
RobertF
  • 824
  • 2
  • 14
  • 40
  • fyi, it is very much recommended that you NOT pass in single values of lambda and instead use the nfolds argument to choose the best cross validated lambda. If you insist on choosing lambdas, the docs suggest you pass in a lambda sequence as the performance can be much slower for single values. – Zelazny7 Apr 13 '16 at 17:03
  • @Zelazny7 - Thank you for the tip! – RobertF Apr 13 '16 at 18:12
  • @Zelazny7 - The author of caret indicated caret *does* cross validate over both alpha and lambda in this post: http://stats.stackexchange.com/questions/69638/does-caret-train-function-for-glmnet-cross-validate-for-both-alpha-and-lambda?rq=1 where the OP used the same expand.grid() syntax I did. Perhaps it's not necessary to use the nfolds or foldid arguments from glmnet when running in caret? – RobertF Apr 13 '16 at 19:28

1 Answers1

1

You have to use the option ClassProbs in trainControl. The factor admit needs to be a character because this will be used as a column name. See following example.

library(caret)

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
mydata$admit <- as.factor(mydata$admit)

#create levels yes/no to make sure the the classprobs get a correct name
levels(mydata$admit) = c("yes", "no")

train_control <- trainControl(method="cv", number=10, classProbs = TRUE, savePredictions = TRUE)
glmnetGrid <- expand.grid(alpha=c(0, .5, 1), lambda=c(.1, 1, 10))
set.seed(4242)
model<- train(admit ~ ., 
              data=mydata, 
              trControl = train_control, 
              method="glmnet", 
              family="binomial", 
              tuneGrid=glmnetGrid, 
              metric="Accuracy", 
              preProcess=c("center","scale"))

head(model$pred)
  pred obs rowIndex       yes        no alpha lambda Resample
1  yes  no        4 0.6856383 0.3143617     0     10   Fold01
2  yes  no        6 0.6796251 0.3203749     0     10   Fold01
3  yes yes       10 0.6764742 0.3235258     0     10   Fold01
4  yes yes       71 0.6795685 0.3204315     0     10   Fold01
5  yes  no       78 0.6774003 0.3225997     0     10   Fold01
6  yes yes       82 0.6812158 0.3187842     0     10   Fold01
phiver
  • 23,048
  • 14
  • 44
  • 56
  • Thanks, so then I use the "yes" column (model$pred$yes) as the predicted probability. – RobertF Apr 13 '16 at 19:07
  • I would use `ifelse(model$pred$pred == "yes", model$pred$yes, model$pred$no)`. If the prediction is no, then the no column contains the predicted probability. – phiver Apr 13 '16 at 19:41
  • It looks like caret is arbitrarily assigning pred="yes" if model$pred$yes>=.5 and pred="no" if model$pred$yes<.5. Look at the min and max values in the "yes" column when running summary(model$pred[ which(model$pred$pred=="no"), ]) and summary(model$pred[ which(model$pred$pred=="yes"), ]). – RobertF Apr 13 '16 at 20:58