R - Parallelizing multiple model learning (with dplyr and purrr)

Question

This is a follow up to a previous question about learning multiple models.

The use case is that I have multiple observations for each subject, and I want to train a model for each of them. See Hadley's excellent presentation on how to do this.

In short, this is possible to do using dplyr and purrr like so:

library(purrr)
library(dplyr)
library(fitdistrplus)
dt %>% 
    split(dt$subject_id) %>%
    map( ~ fitdist(.$observation, "norm"))

So since the model building is an embarrassingly parallel task, I was wondering if dplyr, purrr have an easy to use parallelization mechanism for such tasks (like a parallel map).

If these libraries don't provide easy parallelization could it be done using the classic R parallelization libraries (parallel, foreach etc)?

Bar · Answer 1 · 2017-01-19T14:05:57.240

13

Just adding an answer for completeness here, you will need to install multidplyr from Hadley's repo to run this, more info in the vignette:

library(dplyr)
library(multidplyr)
library(purrr)

cluster <- create_cluster(4)
set_default_cluster(cluster)
cluster_library(cluster, "fitdistrplus")

# dt is a dataframe, subject_id identifies observations from each subject
by_subject <- partition(dt, subject_id)

fits <- by_subject %>% 
    do(fit = fitdist(.$observation, "norm")))

collected_fits <- collect(fits)$fit
collected_summaries <- collected_fits %>% map(summary)

edited Jan 19 '17 at 14:05

answered Oct 13 '16 at 11:00

Bar

2,736
3
33
41

3

multidplyr hasn't been developed for 2 years (as of Aug 2018) so something like furrr might be better. – xiaodai Aug 15 '18 at 04:43

Axeman · Accepted Answer · 2023-02-21T20:20:58.060

13

There is the furrr package now, for example something like:

library(dplyr)
library(furrr)
plan(multisession)   # or perhaps:  plan(multicore), see ?plan

dt %>% 
    split(dt$subject_id) %>%
    future_map(~fitdist(.$observation, "norm"))

edited Feb 21 '23 at 20:20

answered May 23 '18 at 08:31

Axeman

32,068
8
81
94

R - Parallelizing multiple model learning (with dplyr and purrr)

2 Answers2

Linked