How to get a reproducible result when using parallelization to do resampling with mlr3

Question

Recently I was learning about using mlr3 package with parallelization. As the introduction from mlr3 book (https://mlr3book.mlr-org.com/technical.html) and tutorial(https://www.youtube.com/watch?v=T43hO2o_nZw&t=1s), mlr3 uses the future backends for parallelization. I run a simple test with the following code:

# load the packages
library(future)
library(future.apply)
library(mlr3)

# set the task
task_train <- TaskClassif$new(id = "survey_train", backend = train, target = "r_yn", positive = "yes")

# set the learner
learner_ranger <- mlr_learners$get("classif.ranger")

# set the cv
cv_5 <- rsmp("cv", folds = 5)

# run the resampling in parallelization
plan(multisession, workers = 5)
task_train_cv_5_par <- resample(task = task_train, learner = learner_ranger, resampling = cv_5)
plan(sequential)
task_train_cv_5_par$aggregate(msr("classif.auc"))

The AUC changes every time, and I know that because I do not set the random seed for parallelization. But I have found many tutorials about future packages, the way to get a reproducible result with future is using future_lapply from future.apply package and set future.seed = TRUE. The other way is something like setting future backend for foreach loop using %dorng% or registerDoRNG().

My question is how can I get a reproducible resampling result in mlr3 without using future_lapply or foreach? I guess there may be a simple way to get that. Thanks a lot!

score 4 · Accepted Answer · answered Feb 17 '21 at 09:18

I've changed your example to be reproducible to show that you just need to set a seed with set.seed():

library(mlr3)
library(mlr3learners)

task_train <- tsk("sonar")
learner_ranger <- lrn("classif.ranger", predict_type = "prob")
cv_5 <- rsmp("cv", folds = 5)
plan(multisession, workers = 5)

# 1st resampling
set.seed(1)
task_train_cv_5_par <- resample(task = task_train, learner = learner_ranger, resampling = cv_5)
task_train_cv_5_par$aggregate(msr("classif.auc"))

# 2nd resampling
set.seed(1)
task_train_cv_5_par <- resample(task = task_train, learner = learner_ranger, resampling = cv_5)
task_train_cv_5_par$aggregate(msr("classif.auc"))

# 3rd resampling, now sequential
plan(sequential)
set.seed(1)
task_train_cv_5_par <- resample(task = task_train, learner = learner_ranger, resampling = cv_5)
task_train_cv_5_par$aggregate(msr("classif.auc"))

You should get the same score for all three resamplings.

I have tried your method and it works fine, Thanks! – Kim.L Feb 18 '21 at 03:01 — Kim.L, Feb 18 '21 at 03:01

score 3 · Answer 2 · answered Feb 17 '21 at 06:56

3

You need to set a seed with a RNG kind that supports parallelization.

set.seed(42, "L'Ecuyer-CMRG")

See ?RNGkind for more information.

AFAIK for deterministic parallel results in R there is no other way than using this RNG kind. When running sequentially, you can just use the default RNG kind with set.seed(42).

My question is how can I get a reproducible resampling result in mlr3 without using future_lapply or foreach?

{mlr3} uses {future} for all kind of internal parallelization so there is no way around {future}. So yes, set future.seed = TRUE and you should be fine.

answered Feb 17 '21 at 06:56

pat-s

5,992
1
32
60

3

Some minor corrections for the latest CRAN version: (1) You don't need to select a different RNGkind, sticking to the default is fine. future automatically takes care about the rest. (2) We set `future.seed` to `TRUE` internally. – Michel Feb 17 '21 at 09:13
I tried ```set.seed(42, "L'Ecuyer-CMRG")``` and ```set.seed(42)```, and they get the same result. I used to learn that it is not recommended to use ```set.seed()``` in using **parallel** to do parallelization, but it seems to be simpler in the future backend than parallel. – Kim.L Feb 18 '21 at 03:00

How to get a reproducible result when using parallelization to do resampling with mlr3

2 Answers2