1

I followed the documentation of mlr3 regarding the imputation of data with pipelines. However, the mode that I have trained does not allow predictions if a one column is NA

Do you have any idea why it doesn't work?

train step

library(mlr3)
library(mlr3learners)
library(mlr3pipelines)


data("mtcars", package = "datasets")
data = mtcars[, 1:3]
str(data)
task_mtcars = TaskRegr$new(id="cars", backend = data, target = "mpg")


imp_missind = po("missind")
imp_num     = po("imputehist", param_vals =list(affect_columns = selector_type("numeric")))
scale = po("scale")
learner = lrn('regr.ranger')

graph = po("copy", 2) %>>% 
  gunion(list(imp_num %>>% scale,imp_missind)) %>>%
  po("featureunion") %>>%
  po(learner)
graph$plot()

graphlearner = GraphLearner$new(graph)

predict step

data = task_mtcars$data()[12:12,]
data[1:1, cyl:=NA]
predict(graphlearner, data)

The error is

Error: Missing data in columns: cyl.
ZchGarinch
  • 295
  • 3
  • 13

1 Answers1

3

The example in the mlr3gallery seems to work for your case, so you basically have to switch the order of imputehist and missind.

Another approach would be to set the missind's which hyperparameter to "all" in order to enforce the creation of an indicator for every column.

This is actually a bug, where missind returns the full task if trained on data with no missings (which in turn then overwrites the imputed values). Thanks a lot for spotting it. I am trying to fix it here PR

pfistfl
  • 311
  • 1
  • 2
  • Ok I'll try your solution. However I have a question about the possibility of processing only for a few variables. Let's say I have a dataset with 2 categorical variables and 3 numeric variables. I would like to preproces the categorical and numeric variables separately, all in parallel fashion. Is it possible to have an 'po' that allows me to select a few variables for a specific processing? – ZchGarinch Mar 04 '20 at 14:37
  • 2
    yes, have a look at the `affect_columns` hyperparameter in `PipeOpImpute` and use an appropriate `Selector`. This hyperparameter is supported by many `PipeOps`. The following for example selects only a single column: `po("imputehist", param_vals = list(affect_columns =selector_name("Sepal.Length")))` – pfistfl Mar 09 '20 at 07:52
  • How can i drop the original factor columns, after endoding them by `po('encode')` ? – ZchGarinch Mar 10 '20 at 13:53
  • `po(select)` allows for selecting / de-selecting columns. But `po('encode')` should drop original factor columns by itself. – pfistfl Mar 18 '20 at 07:55