How to impute for resampling rather than embedding imputation pipeline with a learner, especially for nested cross validation?

Question

I want to first do imputation within each cv fold and then train the learner with autotuner, and test it on testing sets.

I can see that once the resampling scheme is fixed, the imputation is fixed, so that only (inner folds) * (outer folds) imputations are needed. However, in mlr3, the imputation is combined with the learner by pipelines, the number of imputations will be (inner folds) * (outer folds) * (autotuning evaluations).

Is there any way to impute along with resampling instead of a learner?

score 1 · Answer 1 · answered Dec 21 '21 at 07:36

1

No that is not possible. You are right, it is unnecessary to impute the missing values again for each hyperparameter configuration. Unfortunately, mlr3 cannot cache the imputed data sets.

answered Dec 21 '21 at 07:36

be-marc

1,276
5
5

Thanks for your response, be-marc. In this way, using advanced imputation, e.g., missForest, will steal much more runtime. Do you know any solution to aid on it? For example, specifying imputed data for nested resampling? – Chris Lotus Dec 21 '21 at 16:41
mlr3 only stores ids to define a resampling scheme and no (imputed) copies of the data set. What you want might be possible in future with cached pipelines but this will not be implemented in the next months. Sorry! – be-marc Dec 26 '21 at 21:18
Thanks! After some trials, I set seed, recorded cv splits, and wrote an imputation pipeline to replace the unimputed task with the imputed data. – Chris Lotus Dec 27 '21 at 23:06

How to impute for resampling rather than embedding imputation pipeline with a learner, especially for nested cross validation?

1 Answers1