1

I want to train models on different subsets of data using mlr3, and I was wondering if there a way to train models on different subsets of data in a pipeline.

What I want to do is similar to the example from R for Data Science - Chapter 25: Many models. Say we use the same data set, gapminder, a data set containing different variables for countries around the world, such as GDP and life expectancy. If I wanted to train models for life expectancy for each country, is there an easy way to create such a pipeline using mlr3?

Ideally, I want to use mlr3pipelines to create a branch in the graph for each subset (e.g. a separate branch for each country) with a model at the end. Therefore, the final graph will start at a single node, and have n trained learners at the end nodes, one for each group (i.e. country) in the data set, or a final node that aggregates the results. I would also expect it to work for new data, for example if we obtain new data in the future for 2020, I would want it to be able to create predictions for each country using the model trained for that specific country.

All the mlr3 examples I have found seem to deal with models for the entire data set, or have models trained with all the groups in the training set.

Currently, I am just manually creating a separate task for each group of data, but it would be nice to have the data subsetting step incorporated into the modelling pipeline.

ialm
  • 8,510
  • 4
  • 36
  • 48
  • 1
    To put it into ml-terms. You have a big dataset with one factor column (country in your example). For each factor of said column you want to train an individual model? Be aware that this ensemble will fail for new factors and models won't "learn" from other counties' data. – jakob-r Sep 30 '20 at 09:10
  • @jakob-r Yes, that is correct. Ideally, I guess I would have a separate pipeline for factors that were not present in the training data. Or if I really wanted to have a forecast, maybe I could apply some clustering to estimate the best matching group(s) to temporarily impute the factor for forecast purposes until there is enough data to train models for the new factor. – ialm Sep 30 '20 at 20:29

1 Answers1

1

It would help if you had functions from these two packages: dplyr and tidyr. The following code shows you how to train multiple models by country:

library(dplyr)
library(tidyr)

df <- gapminder::gapminder

by_country <- 
  df %>% 
  nest(data = -c(continent, country)) %>% 
  mutate(model = lapply(data, learn))

Note that learn is a function that takes a single dataframe as its input. I will show you how to define that function later. Now you need to know that the returned dataframe from this pipeline is as follows:

# A tibble: 142 x 4
   country     continent data              model     
   <fct>       <fct>     <list>            <list>    
 1 Afghanistan Asia      <tibble [12 x 4]> <LrnrRgrR>
 2 Albania     Europe    <tibble [12 x 4]> <LrnrRgrR>
 3 Algeria     Africa    <tibble [12 x 4]> <LrnrRgrR>
 4 Angola      Africa    <tibble [12 x 4]> <LrnrRgrR>
 5 Argentina   Americas  <tibble [12 x 4]> <LrnrRgrR>
 6 Australia   Oceania   <tibble [12 x 4]> <LrnrRgrR>
 7 Austria     Europe    <tibble [12 x 4]> <LrnrRgrR>
 8 Bahrain     Asia      <tibble [12 x 4]> <LrnrRgrR>
 9 Bangladesh  Asia      <tibble [12 x 4]> <LrnrRgrR>
10 Belgium     Europe    <tibble [12 x 4]> <LrnrRgrR>

To define the learn function, I follow the steps provided on the mlr3 website. The function is

learn <- function(df) {
  # I create a regression task as the target `lifeExp` is a numeric variable.
  task <- mlr3::TaskRegr$new(id = "gapminder", backend = df, target = "lifeExp")
  # define the learner you want to use.
  learner <- mlr3::lrn("regr.rpart")
  # train your dataset and return the trained model as an output
  learner$train(task)
}

I hope this solve your problem.

New

Consider the following steps to train your model and predict the result for each country.

create_task <- function(id, df, ratio) {
  train <- sample(nrow(df), ratio * nrow(df))
  task <- mlr3::TaskRegr$new(id = as.character(id), backend = df, target = "lifeExp")
  list(task = task, train = train, test = seq_len(nrow(df))[-train])
}

model_task <- function(learner, task_list) {
  learner$train(task_list[["task"]], row_ids = task_list[["train"]])
}

predict_result <- function(learner, task_list) {
  learner$predict(task_list[["task"]], row_ids = task_list[["test"]])
}

by_country <- 
  df %>% 
  nest(data = -c(continent, country)) %>% 
  mutate(
    task_list = Map(create_task, country, data, 0.8), 
    learner = list(mlr3::lrn("regr.rpart"))
  ) %>% 
  within({
    Map(model_task, learner, task_list)
    prediction <- Map(predict_result, learner, task_list)
  })
ekoam
  • 8,744
  • 1
  • 9
  • 22
  • Thanks, this is similar to what I am currently doing, except with `data.table`. How would one go about producing forecasts with the trained models in this example? – ialm Sep 30 '20 at 20:26
  • 1
    You need additional setups. See my new post above. @ialm – ekoam Oct 01 '20 at 02:54
  • Thanks, I guess that the current pipelines do not enable my desired workflow. I ended up doing something similar, except with `data.table`s instead of `tibbles` since my data was already being processed using `data.table`. – ialm Oct 02 '20 at 21:45