1

I have some data similar to:

data(Titanic) # need one row per passenger

df <- data.frame(Titanic, stringsAsFactors=TRUE) 

df <- df[rep(seq_len(nrow(df)), df[,"Freq"]), which(names(df)!="Freq")] 

I trained a model in caret using repeated cross-validated logistic regression, like:

library(caret) 

tc <- trainControl(method="repeatedcv", number=10, repeats=3, 
                   returnData=TRUE, savePredictions=TRUE, classProbs=TRUE)

glmFit <- train(Survived ~ Class + Sex + Age, data = df, weights=Freq, 
                method="glm", family="binomial",
                trControl = tc)

summary(glmFit)

I would like to obtain the average in-sample fitted probability and out-of-sample predicted probability (averages of 27 and of 3 values for each row in the data frame, respectively, in this case since it's 10-fold CV x 3 repeats).

I would like to append each row's average in-sample and out-of-sample probability estimates onto the data frame -- to look like the last two columns of:

>df_appended
| Class  | Sex |  Age | Survived | training_p_surv_est | testing_p_surv_est |  
      3rd     M  Child          0                  .251                 .259
      3rd     M  Child          1                  .251                 .259
      2nd     M  Child          1                  .324                 .319
      2nd     M  Child          0                  .324                 .319   

According to ?trainControl, I have saved the holdout predictions for each resample with savePredictions=TRUE. (And classProbs=TRUE, since I want raw probabilities, not classes.)

How do I access the in-sample and out-of-sample predictions? Looking at ?predict.train, I have tried using

extractProb(list(glmFit)) 
#Error in eval(expr, envir, enclos) : object 'Class2nd' not found 

Many thanks.

milos.ai
  • 3,882
  • 7
  • 31
  • 33
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134

1 Answers1

0

If you take a look at your glmFit object. It contains a sublist named 'pred'.

head(glmFit$pred)

You will get the predicted probability as well as predicted class for each cv and fold.

cheers.

yuanhangliu1
  • 157
  • 1
  • 1
  • 7
  • 1
    I am not sure if this is what the question asked about. `glmFit$pred` would still give the out-of-sample performance, as every in every fold the held-out sample is not used while training, but only while predicting. – exAres Jun 21 '15 at 16:28