1

I have been delving into the R package caret recently, and have a question about reproducibility and comparison of models during training that I haven't quite been able to pin down.

My intention is that each train call, and thus each resulting model, uses the same cross validation splits so that the initial stored results from the cross-validation are comparable from the out-of-sample estimations of the model that are calculated during building.

One method I've seen is that you can specify the seed prior to each train call as such:

set.seed(1)
model <- train(..., trControl = trainControl(...))
set.seed(1)
model2 <- train(..., trControl = trainControl(...))
set.seed(1)
model3 <- train(..., trControl = trainControl(...))

However, does sharing a trainControl object between the train calls mean that they are using the same resampling and indexes generally or whether I have to explicitly pass the seeds argument into the function. Does the train control object have random functions when it is used or are they set on declaration?

My current method has been:

set.seed(1)
train_control <- trainControl(method="cv", ...)
model1 <- train(..., trControl = train_control)
model2 <- train(..., trControl = train_control)
model3 <- train(..., trControl = train_control)

Are these train calls going to be using the same splits and be comparable, or do I have to take further steps to ensure that? i.e. specifying seeds when the trainControl object is made, or calling set.seed before each train? Or both?

Hopefully this has made some sense, and isn't a complete load of rubbish. Any help


My code project that I'm querying about can be found here. It might be easier to read it and you'll understand.

merv
  • 67,214
  • 13
  • 180
  • 245
wcanners
  • 35
  • 7

1 Answers1

0

The CV folds are not created during defining trainControl unless explicitly stated using index argument which I recommend. These can be created using one of the specialized caret functions:

createFolds
createMultiFolds
createTimeSlices
groupKFold

That being said, using a specific seed prior to trainControl definition will not result in the same CV folds.

Example:

library(caret)
library(tidyverse)

set.seed(1)
trControl = trainControl(method = "cv",
                         returnResamp = "final",
                         savePredictions = "final")

create two models:

knnFit1 <- train(iris[,1:4], iris[,5],
                 method = "knn",
                 preProcess = c("center", "scale"),
                 tuneLength = 10,
                 trControl = trControl)

ldaFit2 <- train(iris[,1:4], iris[,5],
                 method = "lda",
                 tuneLength = 10,
                 trControl = trControl)

check if the same indexes are in the same folds:

knnFit1$pred %>%
  left_join(ldaFit2$pred, by = "rowIndex") %>%
  mutate(same = Resample.x == Resample.y) %>%
  {all(.$same)}
#FALSE

If you set the same seed prior each train call

set.seed(1)
knnFit1 <- train(iris[,1:4], iris[,5],
                 method = "knn",
                 preProcess = c("center", "scale"),
                 tuneLength = 10,
                 trControl = trControl)

set.seed(1)
ldaFit2 <- train(iris[,1:4], iris[,5],
                 method = "lda",
                 tuneLength = 10,
                 trControl = trControl)


set.seed(1)
rangerFit3 <- train(iris[,1:4], iris[,5],
                 method = "ranger",
                 tuneLength = 10,
                 trControl = trControl)


knnFit1$pred %>%
  left_join(ldaFit2$pred, by = "rowIndex") %>%
  mutate(same = Resample.x == Resample.y) %>%
  {all(.$same)}

knnFit1$pred %>%
  left_join(rangerFit3$pred, by = "rowIndex") %>%
  mutate(same = Resample.x == Resample.y) %>%
  {all(.$same)}

the same indexes will be used in the folds. However I would not rely on this method when using parallel computation. Therefore in order to insure the same data splits are used it is best to define them manually using index/indexOut arguments to trainControl.

When you set the index argument manually the folds will be the same, however this does not ensure that models made by the same method will be the same, since most methods include some sort of stochastic process. So to be fully reproducible it is advisable to set the seed prior to each train call also. When run in parallel to get fully reproducible models the seeds argument to trainControl needs to be set.

missuse
  • 19,056
  • 3
  • 25
  • 47
  • 1
    Thanks for your response @missuse. In your first example, you don't actually use the trainControl that you've declared, but if the CV folds aren't set when the object is declared then it won't matter. The `Index` argument sounds like the way forward. If you use the `index` argument, then is it necessary to set the seed prior to each `train`, or is that seen as good practice? – wcanners Oct 03 '18 at 09:22
  • @W. Canniford You are correct, I have fixed the code. In general when you set the `index` argument manually the folds will be the same however this does not ensure that models made by the same method will be the same since most methods include some sort of stochastic process. So to be fully reproducible it is advisable to set the seed prior to each train call also. When run in parallel to get fully reproducible models the `seeds` argument to `trainControl` needs to be set. – missuse Oct 03 '18 at 09:44
  • Thanks again @missuse! `index` sets the folds of the cross validation, `seeds` set the seeds that will be set during the `train` so that reproducible work can be done in parallel where using `set.seed` isn't possible and `set.seed` can be used alongside `index` when running processes non-parallel as some models have other random processes besides the cv folds/splits. I think I've got it. Thanks. – wcanners Oct 03 '18 at 10:09