Plot learning curves with caret package and R

Question

I would like to study the optimal tradeoff between bias/variance for model tuning. I'm using caret for R which allows me to plot the performance metric (AUC, accuracy...) against the hyperparameters of the model (mtry, lambda, etc.) and automatically chooses the max. This typically returns a good model, but if I want to dig further and choose a different bias/variance tradeoff I need a learning curve, not a performance curve.

For the sake of simplicity, let's say my model is a random forest, which has just one hyperparameter 'mtry'

I would like to plot the learning curves of both training and test sets. Something like this:

learning curve

(red curve is the test set)

On the y axis I put an error metric (number of misclassified examples or something like that); on the x axis 'mtry' or alternatively the training set size.

Questions:

Has caret the functionality to iteratively train models based of training set folds different in size? If I have to code by hand, how can I do that?
If I want to put the hyperparameter on the x axis, I need all the models trained by caret::train, not just the final model (the one with maximum performance got after CV). Are these "discarded" model still available after train?

See: http://topepo.github.io/caret/model-training-and-tuning.html#plotting-the-resampling-profile — Brian D, Jun 11 '20 at 17:28

Stephen Henderson · Answer 1 · 2016-12-01T13:56:27.393

4

Caret will iteratively test lots of cv models for you if you set the trainControl() function and the parameters (e.g. mtry) using a tuneGrid(). Both of these are then passed as control options to the train() function. The specifics of the tuneGrid parameters (e.g. mtry, ntree) will be different for each model type.
Yes the final trainFit model will contain the error rate (however you specified it) for all folds of your CV.

So you could specify e.g. a 10-fold CV times a grid with 10 values of mtry -which would be 100 iterations. You might want to go get a cup of tea or possibly lunch.

If this sounds complicated ... there is a very good example here - caret being one of the best documented packages about.

edited Dec 01 '16 at 13:56

answered Dec 04 '13 at 09:51

Stephen Henderson

6,340
3
27
33

1

Tnx for your answer, but I'm still in doubt. What I need is *not* to iteratively train on different folds for CV (boostrap,...), but simulate the entire train() process (including CV and so) with different subsets of the training set (10%, 20%...100%). I, basically, would like to aestimate if more training set size would reduce my high variance. For 2., I need the error rate of varying the hyperparameter mtry, but if I understood what you mean, trainFit will have the error rate calculated on different CV folds (while the final one is the average of all, I suppose) – Gabriele B Dec 04 '13 at 10:11
If you want to make pre-train splits then use `createDataPartition` to make several different balanced trainSets e.g train10,train20,train100, then run `train` with a matrix of `tuneGrid` options on each. You may even have to repeat with a few different draws train10a, train10b,train10c etc. – Stephen Henderson Dec 04 '13 at 10:29
1

As far as I can read, with tuneGrid I can specify the range of every hyperparameter to be tested. This is great. Then, I will use createDataPartition to pre-split the data and call train() n times using a loop (or something like that). So, just need to solve the last question: how to get error rate of varying parameter? – Gabriele B Dec 04 '13 at 10:45
@StephenHenderson That link is ded. I believe this is what you are referring to [the Caret package documentation](http://topepo.github.io/caret/model-training-and-tuning.html#customizing-the-tuning-process) by Max Kuhn – Ekaba Bisong Nov 28 '16 at 16:02
@EkabaBisong Thanks I've edited the original answer to use the new link you suggested. – Stephen Henderson Dec 01 '16 at 13:58

Ekaba Bisong · Answer 2 · 2020-08-02T07:26:11.650

Here's my code on how I approached this issue of plotting a learning curve in R while using the Caret package to train your model. I use the Motor Trend Car Road Tests in R for illustrative purposes. To begin, I randomize and split the mtcars dataset into training and test sets. 21 records for training and 13 records for the test set. The response feature is mpg in this example.

# set seed for reproducibility
set.seed(7)

# randomize mtcars
mtcars <- mtcars[sample(nrow(mtcars)),]

# split iris data into training and test sets
mtcarsIndex <- createDataPartition(mtcars$mpg, p = .625, list = F)
mtcarsTrain <- mtcars[mtcarsIndex,]
mtcarsTest <- mtcars[-mtcarsIndex,]

# create empty data frame 
learnCurve <- data.frame(m = integer(21),
                     trainRMSE = integer(21),
                     cvRMSE = integer(21))

# test data response feature
testY <- mtcarsTest$mpg

# Run algorithms using 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"

# loop over training examples
for (i in 3:21) {
    learnCurve$m[i] <- i
    
    # train learning algorithm with size i
    fit.lm <- train(mpg~., data=mtcarsTrain[1:i,], method="lm", metric=metric,
             preProc=c("center", "scale"), trControl=trainControl)        
    learnCurve$trainRMSE[i] <- fit.lm$results$RMSE
    
    # use trained parameters to predict on test data
    prediction <- predict(fit.lm, newdata = mtcarsTest[,-1])
    rmse <- postResample(prediction, testY)
    learnCurve$cvRMSE[i] <- rmse[1]
}

pdf("LinearRegressionLearningCurve.pdf", width = 7, height = 7, pointsize=12)

# plot learning curves of training set size vs. error measure
# for training set and test set
plot(log(learnCurve$trainRMSE),type = "o",col = "red", xlab = "Training set size",
          ylab = "Error (RMSE)", main = "Linear Model Learning Curve")
lines(log(learnCurve$cvRMSE), type = "o", col = "blue")
legend('topright', c("Train error", "Test error"), lty = c(1,1), lwd = c(2.5, 2.5),
       col = c("red", "blue"))

dev.off()

The output plot is as shown below:

Thank you!!　This is exactly what I was looking for. – programandoconro Oct 18 '17 at 12:18 — programandoconro, Oct 18 '17 at 12:18

makeyourownmaker · Accepted Answer · 2020-08-01T11:02:38.763

At some point, probably after this question was asked, the caret package added the learning_curve_dat function which helps assess model performance across a range of training set sizes.

Here is the example from the function documentation:

library(caret)
set.seed(1412)
class_dat <- twoClassSim(1000)

set.seed(29510)
lda_data <- learning_curve_dat(dat = class_dat, 
                               outcome = "Class",
                               test_prop = 1/4, 
                               ## `train` arguments:
                               method = "lda", 
                               metric = "ROC",
                               trControl = trainControl(classProbs = TRUE, 
                                                        summaryFunction = twoClassSummary))

ggplot(lda_data, aes(x = Training_Size, y = ROC, color = Data)) + 
  geom_smooth(method = loess, span = .8)

The performance metric(s) are found for each Training_Size and saved in lda_data along with the Data variable ("Resampling", "Training", and optionally "Testing").

Here is a link to the function documentation: https://rdrr.io/cran/caret/man/learning_curve_dat.html

To be clear, this answers the first part of the question but not the second part.

NOTE Before at least August 2020 there was a typo in the caret package code and documentation. The function call was learing_curve_dat before it was corrected to learning_curve_dat. I've updated my answer to reflect this change. Make sure you are using a recent version of the caret package.

I was scratching my head why this function doesn't work and I've noticed that there's a typo in the caret package, it's spelled as `learing_curve_dat()`, instead of `learning_curve_dat()`. — gofraidh, Dec 17 '18 at 23:33
@SimonWoodward Thanks for update on learing_curve_dat spelling fix. — makeyourownmaker, Aug 01 '20 at 11:06

Plot learning curves with caret package and R

3 Answers3

Linked