Background
I'm modeling and predicting with the mlr3 package in R. I'm working with one big data set that consists out of test and train sets. Test and train sets are indicated by an indicator column (in code: test_or_train).
Goal
- Batch train all learners with the train rows indicated by the train_or_test column in the data set.
- Batch predict the rows designated by the 'test' in the test_or_train column with the respective trained learner.
Code
- Place holder data set with test-train indicator column. (In the actual data train-test split is not artifictial)
- Two tasks (in the actual code tasks are distinct and there are more.)
library(readr)
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(reprex)
library(caret)
# Data
urlfile = 'https://raw.githubusercontent.com/shudras/office_data/master/office_data.csv'
data = read_csv(url(urlfile))[-1]
## Create artificial partition to test and train sets
art_part = createDataPartition(data$imdb_rating, list=FALSE)
train = data[art_part,]
test = data[-art_part,]
## Add test-train indicators
train$test_or_train = 'train'
test$test_or_train = 'test'
## Data set that I want to work / am working with
data = rbind(test, train)
# Create two tasks (Here the tasks are the same but in my data set they differ.)
task1 =
TaskRegr$new(
id = 'office1',
backend = data,
target = 'imdb_rating'
)
task2 =
TaskRegr$new(
id = 'office2',
backend = data,
target = 'imdb_rating'
)
# Model specification
graph =
po('scale') %>>%
lrn('regr.cv_glmnet',
id = 'rp',
alpha = 1,
family = 'gaussian'
)
# Learner creation
learner = GraphLearner$new(graph)
# Goal
## 1. Batch train all learners with the train rows indicated by the train_or_test column in the data set
## 2. Batch predict the rows designated by the 'test' in the test_or_train column with the respective trained learner
Created on 2020-06-22 by the reprex package (v0.3.0)
Note
I tried using benchmark_grid with row_ids to only train the learner with the train rows but this did not work and it was also not possible to work with the column designator with is much easier than with row indices. With the column test-train designator one can work with one rule (for the split) whereas working with the row indices only works as long as the tasks contain the same rows.
benchmark_grid(
tasks = list(task1, task2),
learners = learner,
row_ids = train_rows # Not an argument and not favorable to work with indices
)