1

Say I am doing classification like below:

library(mlbench)
data(Sonar)

library(caret)
set.seed(998)

my_data <- Sonar

fitControl <-
  trainControl(
    method = "cv",
    number = 10,
    classProbs = T,
    savePredictions = T,
    summaryFunction = twoClassSummary
  )


model <- train(
  Class ~ .,
  data = my_data,
  method = "xgbTree",
  trControl = fitControl,
  metric = "ROC"
)

For each of the 10 folds, 10% of the data is being used for validation. For the best parameters that caret determines, I can use getTrainPerf(model) to find the mean validation AUC over all 10 folds or model$resample to get the individual values of AUC for each fold. However, I cannot get the AUC if the training data was put back into the same model. It would be nice if I can get the individual AUC values for each training set.

How can this information be extracted? I want to make sure my model is not overfit (data set I am working with is very small).

Thanks!

Keshav M
  • 1,309
  • 1
  • 13
  • 24
  • This works, but only reports the AUC for the validation set (the 10% the model is tested on). However, I am interested in the AUC if the other 90% of training data was put back into the model. Any idea how to do that? – Keshav M Jan 06 '18 at 19:48
  • FYI, to subset I did: `for (a in 1:length(model$bestTune)) { model$pred <- model$pred[model$pred[, paste(colnames(model$bestTune)[a])] == model$bestTune[1, a], ] }` – Keshav M Jan 06 '18 at 19:52
  • 2
    The value from the training set cannot be used for the purpose you intend. – IRTFM Jan 06 '18 at 22:54
  • @42- Can you elaborate on that? How would you recommend testing whether my model is overfit if I do not have a separate data set to validate on? Doing 10-fold cross validation is the only "test" set I am working with. – Keshav M Jan 06 '18 at 23:07
  • I agree with @42, with a model such as xgboost it is likely AUC on the test set folds will be 1. This will tell you little about over-fitting. Obviously it will be over-fit, the value on the test folds tell you the amount of over fit. On topic, I am unsure if the test set error can be extracted from the caret train object, but it can be easily estimated if you take the exact same folds and make a custom function to do 10 fold CV and return the test set metric you desire. If you are interested in this I can post an answer how it can be achieved. – missuse Jan 06 '18 at 23:40
  • With a small data set I would still try to split 80 : 20 and have an independent test set to verify if the CV error is close to test error. – missuse Jan 06 '18 at 23:42
  • @missuse I would really appreciate it if you could post this answer. If caret could still be used for training, that would be awesome as I am testing other algorithms too and caret makes it easy to switch algorithms and optimize parameters. Thanks! – Keshav M Jan 06 '18 at 23:46
  • @Keshav M while I really like caret I thin that xgboost has some really interesting parameters such as `early_stopping_round`, `scale_pos_weight`, `alpha`, `lambda` and `colsample_by_level` that are not available in caret train. Also `xgb.cv` is a quite solid function for hyper parameter tuning. – missuse Jan 07 '18 at 00:37

1 Answers1

1

As requested in the comments here is a custom function to evaluate the cross validation test error. I am not sure if it can be extracted from the caret train object.

After running the caret train extract the folds for the best tune:

library(tidyverse)
model$bestTune %>%
  left_join(model$pred) %>%
  select(rowIndex, Resample) %>%
  mutate(Resample = as.numeric(gsub(".*(\\d$)", "\\1", Resample)),
         Resample = ifelse(Resample == 0, 10, Resample)) %>%
  arrange(rowIndex) -> resamples

Construct a cross validation function that will use the same folds as caret did:

library(xgboost)
train <- my_data[,!names(my_data)%in% "Class"]
label <- as.numeric(my_data$Class) - 1

test_auc <- lapply(1:10, function(x){
  model <- xgboost(data = data.matrix(train[resamples[,2] != x,]),
                   label = label[resamples[,2] != x],
                   nrounds = model$bestTune$nrounds,
                   max_depth = model$bestTune$max_depth,
                   gamma = model$bestTune$gamma,
                   colsample_bytree = model$bestTune$colsample_bytree,
                   objective = "binary:logistic",
                   eval_metric= "auc" ,
                   print_every_n = 50)
  preds_train <- predict(model, data.matrix(train[resamples[,2] != x,]))
  preds_test <- predict(model, data.matrix(train[resamples[,2] == x,]))
  auc_train <- pROC::auc(pROC::roc(response = label[resamples[,2] != x], predictor = preds_train, levels = c(0, 1)))
  auc_test <- pROC::auc(pROC::roc(response = label[resamples[,2] == x], predictor = preds_test, levels = c(0, 1)))
  return(data.frame(fold = unique(resamples[resamples[,2] == x, 2]), auc_train, auc_test))
  })

do.call(rbind, test_auc)
#output
   fold auc_train  auc_test
1     1         1 0.9909091
2     2         1 0.9797980
3     3         1 0.9090909
4     4         1 0.9629630
5     5         1 0.9363636
6     6         1 0.9363636
7     7         1 0.9181818
8     8         1 0.9636364
9     9         1 0.9818182
10   10         1 0.8888889

arrange(model$resample, Resample)
#output
         ROC      Sens      Spec Resample
1  0.9909091 1.0000000 0.8000000   Fold01
2  0.9898990 0.9090909 0.8888889   Fold02
3  0.9909091 0.9090909 1.0000000   Fold03
4  0.9444444 0.8333333 0.8888889   Fold04
5  0.9545455 0.9090909 0.8000000   Fold05
6  0.9272727 1.0000000 0.7000000   Fold06
7  0.9181818 0.9090909 0.9000000   Fold07
8  0.9454545 0.9090909 0.8000000   Fold08
9  0.9909091 0.9090909 0.9000000   Fold09
10 0.8888889 0.9090909 0.7777778   Fold10

Why the test fold AUC from my function and caret are not the same I can not say. I am fairly sure the same parameters and folds were used. I can assume it has to do with the random seed. When I check the auc of caret test predictions I get the same output as caret:

model$bestTune %>%
  left_join(model$pred) %>%
  arrange(rowIndex) %>%
  select(M, Resample, obs) %>%
  mutate(Resample = as.numeric(gsub(".*(\\d$)", "\\1", Resample)),
                             Resample = ifelse(Resample == 0, 10, Resample),
         obs = as.numeric(obs) - 1) %>%
  group_by(Resample) %>%
  do(auc = as.vector(pROC::auc(pROC::roc(response = .$obs, predictor = .$M)))) %>%
  unnest()
#output
   Resample   auc
      <dbl> <dbl>
 1     1.00 0.991
 2     2.00 0.990
 3     3.00 0.991
 4     4.00 0.944
 5     5.00 0.955
 6     6.00 0.927
 7     7.00 0.918
 8     8.00 0.945
 9     9.00 0.991
10    10.0  0.889

But again I emphasize test error will tell you little and you should rely on train error. If you would like to bring the two closer than consider fiddling with gamma, alpha and lambda parameters.

With a small data set I would still try to split train : test = 80 : 20 and use that independent test set to verify if the CV error is close to test error.

missuse
  • 19,056
  • 3
  • 25
  • 47
  • If the test error of the independent test set is similar to the CV error, then the model is not overfit, correct? – Keshav M Jan 07 '18 at 00:59
  • @Keshav M if the CV error correlates with the test error than the model is doing ok. Being close does not mean much, but when the CV error is getting smaller so should the test error. If not there is a problem. – missuse Jan 07 '18 at 01:09
  • Sorry I am a bit confused. Where am I getting multiple CV error and test error values to look for a correlation? I was just considering the single values of test and CV error that would be generated by doing CV then independent test. Thank you so much for your help :) – Keshav M Jan 07 '18 at 01:42
  • @Keshav M You are testing a bunch of hyper parameters and possibly several learners. For all of these you can generate CV error and validation error. While you should not compere these for all tested cases since it will most likely lead to over-fitting (especially on a small data set) you can take several and see if they have the same trend. Validation error is a highly biased metric (depending on the held out samples it can return very different numbers) while CV has low bias and low variance. However CV is often miss specified (stratification, blocking, oversampling). – missuse Jan 07 '18 at 07:55