I want to train models on different subsets of data using mlr3
, and I was wondering if there a way to train models on different subsets of data in a pipeline.
What I want to do is similar to the example from R for Data Science - Chapter 25: Many models. Say we use the same data set, gapminder
, a data set containing different variables for countries around the world, such as GDP and life expectancy. If I wanted to train models for life expectancy for each country, is there an easy way to create such a pipeline using mlr3
?
Ideally, I want to use mlr3pipelines
to create a branch in the graph for each subset (e.g. a separate branch for each country) with a model at the end. Therefore, the final graph will start at a single node, and have n
trained learners at the end nodes, one for each group (i.e. country) in the data set, or a final node that aggregates the results. I would also expect it to work for new data, for example if we obtain new data in the future for 2020, I would want it to be able to create predictions for each country using the model trained for that specific country.
All the mlr3
examples I have found seem to deal with models for the entire data set, or have models trained with all the groups in the training set.
Currently, I am just manually creating a separate task for each group of data, but it would be nice to have the data subsetting step incorporated into the modelling pipeline.