Imputation of target using mlr3

Question

After studying the sources describing mlr3 and looking at the given examples I still couldn't find any answer about how to impute the target variable during a regression task, when it has missings. I want to use Ranger, but it can't deal with missings in the target variable.

Error: Task 'Airtemp' has missing values in column(s) 'T.means.hr', but learner 'regr.ranger' does not support this
This happened PipeOp regr.ranger's $train()

task_Airtemp$missings()

Output:

T.means.hr     H.means.hr   Rad.means.hr  timestamp 
266              213              739        0

Thanks to the tutorials and the mlr3book I was quickly able to include missing indicators and imputation in my workflow as a pipeOp but only for the features.

pom = po("missind") # Add missing indicator columns ("dummy columns") to the Task
pon = po("imputehist", id = "imputer_num") # Imputes numerical features by histogram

For example you can see, how the target variable is unaffected by the pipeOp pom:

task_ext$data()

         T.means.hr   missing_H.means.hr missing_Rad.means.hr missing_timestamp
   1:        23.61     present                  present               present

My first idea was just to define a task without declaring it as a regression task (as_task() instead of as_task_regr()) and defining the target variable at the end of the workflow for the learner, but that didn't work out:

Error in UseMethod("as_task") : 
  no applicable method for 'as_task' applied to an object of class "data.frame"

The idea of changing the role of the target to a feature with:

task_Airtemp$col_roles$feature = "T.means.hr"

and setting it back to target after the pipeOps pom and pon are done didn't prove successful either.

For the Resampling step I want to use RollingWindowCV from the mlr3temporal package. That's why it is imporant to me, that I have a time series without missings.

rr = resample(task_Airtemp, graph_learner, rsmp("RollingWindowCV", folds = 10, fixed_window = T, window_size = window.size, horizon = predict.horizon))

Sorry, if I have overlooked something and thanks for the amazing package. :)

Your best bet is to try some sort of semisupervised learning algorithm where unlabeled examples are used for training. As @pat-s answered imputing the target variable using common imputation methods is a very bad idea. — missuse, Aug 25 '22 at 18:08

score 4 · Accepted Answer · answered Aug 25 '22 at 08:27

Usually you'll want to impute features but not the target as it might introduce (substantial) bias into your model. This also applies to some degree to feature imputation but the target variable weighs heavier.

See also https://datascience.stackexchange.com/questions/26581/should-i-impute-target-values and possible other discussions on this topic.

I don't think {mlr3pipelines} is capable of imputing the target variable (but I am not 100% sure about this). Of course you are free to impute the target variable outside of {mlr3} and then create a task with it but I would not recommend it.

Imputation of target using mlr3

1 Answers1