How to subset task according to indicator column and batch train-predict in mlr3?

Question

Background

I'm modeling and predicting with the mlr3 package in R. I'm working with one big data set that consists out of test and train sets. Test and train sets are indicated by an indicator column (in code: test_or_train).

Goal

Batch train all learners with the train rows indicated by the train_or_test column in the data set.
Batch predict the rows designated by the 'test' in the test_or_train column with the respective trained learner.

Code

Place holder data set with test-train indicator column. (In the actual data train-test split is not artifictial)
Two tasks (in the actual code tasks are distinct and there are more.)

library(readr)
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(reprex)
library(caret)

# Data
urlfile = 'https://raw.githubusercontent.com/shudras/office_data/master/office_data.csv'
data = read_csv(url(urlfile))[-1]

## Create artificial partition to test and train sets
art_part = createDataPartition(data$imdb_rating, list=FALSE)
train = data[art_part,]
test = data[-art_part,]

## Add test-train indicators
train$test_or_train = 'train'
test$test_or_train = 'test'

## Data set that I want to work / am working with
data = rbind(test, train)

# Create two tasks (Here the tasks are the same but in my data set they differ.)
task1 = 
  TaskRegr$new(
    id = 'office1', 
    backend = data, 
    target = 'imdb_rating'
  )
task2 = 
  TaskRegr$new(
    id = 'office2', 
    backend = data, 
    target = 'imdb_rating'
  )


# Model specification 
graph = 
  po('scale') %>>% 
  lrn('regr.cv_glmnet', 
      id = 'rp', 
      alpha = 1, 
      family = 'gaussian'
  ) 

# Learner creation
learner = GraphLearner$new(graph)

# Goal 
## 1. Batch train all learners with the train rows indicated by the train_or_test column in the data set
## 2. Batch predict the rows designated by the 'test' in the test_or_train column with the respective trained learner

^{Created on 2020-06-22 by the reprex package (v0.3.0)}

Note

I tried using benchmark_grid with row_ids to only train the learner with the train rows but this did not work and it was also not possible to work with the column designator with is much easier than with row indices. With the column test-train designator one can work with one rule (for the split) whereas working with the row indices only works as long as the tasks contain the same rows.

benchmark_grid(
    tasks = list(task1, task2), 
    learners = learner, 
    row_ids = train_rows # Not an argument and not favorable to work with indices
)

score 6 · Accepted Answer · answered Jun 22 '20 at 08:00

You can use benchmark with a custom design.

The following should do the job (note that I instantiate a custom Resampling for each Task seperately.

library(data.table)
design = data.table(
  task = list(task1, task2),
  learner = list(learner)
)

library(mlr3misc)
design$resampling = map(design$task, function(x) {
  # get train/test split
  split = x$data()[["test_or_train"]]
  # remove train-test split column from the task
  x$select(setdiff(x$feature_names, "test_or_train"))
  # instantiate a custom resampling with the given split
  rsmp("custom")$instantiate(x,
    train_sets = list(which(split == "train")),
    test_sets = list(which(split == "test"))
  )
})

benchmark(design)

Could you specify what you mean by batch-processing more clearly or does this answer your question?

How to subset task according to indicator column and batch train-predict in mlr3?

Background

Goal

Code

Note

1 Answers1