19

I'm trying to figure out how to deploy the dplyr::do function in parallel. After reading some the docs it seems that the dplyr::init_cluster() should be sufficient for telling the do() to run in parallel. Unfortunately this doesn't seem to be the case when I test this:

library(dplyr)
test <- data_frame(a=1:3, b=letters[c(1:2, 1)])

init_cluster()
system.time({
  test %>%
    group_by(b) %>%
    do({
      Sys.sleep(3)
      data_frame(c = rep(max(.$a), times = max(.$a)))
    })
})
stop_cluster()

Gives this output:

Initialising 2 core cluster.
|==========================================================================|100% ~0 s remaining
   user  system elapsed 
   0.03    0.00    6.03 

I would expect it to be 3 if the do call was split between the two cores. I can also confirm this by adding a print to the do() that prints in the main R-terminal. What am I missing here?

I'm using dplyr 0.4.2 with R 3.2.1

Max Gordon
  • 5,367
  • 2
  • 44
  • 70
  • I've found that for really critical code, the best way, at least for my use cases, is to get your hands dirty with Rcpp and OpenMP. It's mostly beyond my computer science ability, but there seem to be so many subtle cache interactions, and sometimes processor or compiler quirks, that you need to just profile and benchmark carefully. I also found structuring the data well often made the biggest difference, and could help parallelization significantly. Good luck! – Jack Wasey Oct 02 '15 at 15:02

3 Answers3

26

As per mentionned by @Maciej, you could try multidplyr:

## Install from github
devtools::install_github("hadley/multidplyr")

Use partition() to split your dataset across multiples cores:

library(dplyr)
library(multidplyr)
test <- data_frame(a=1:3, b=letters[c(1:2, 1)])
test1 <- partition(test, a)

You'll initialize a 3 cores cluster (one for each a)

# Initialising 3 core cluster.

Then simply perform your do() call:

test1 %>%
  do({
    dplyr::data_frame(c = rep(max(.$a)), times = max(.$a))
  })

Which gives:

#Source: party_df [3 x 3]
#Groups: a
#Shards: 3 [1--1 rows]
#
#      a     c times
#  (int) (int) (int)
#1     1     1     1
#2     2     2     2
#3     3     3     3
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
  • 2
    Thanks! Looked into the @Maciej's answer and it's great that this finally has arrived. I often do more complex tasks than the ones available in summarize and without the parallelization I could not really find dplyr that useful as many claim it to be. – Max Gordon Nov 14 '15 at 20:13
  • @MaxGordon Glad it helped ! – Steven Beaupré Nov 14 '15 at 21:00
  • How do you send a user defined function that is to be performed with `do()` to each node? I'm getting "function not found" – Dominik Dec 10 '15 at 17:05
  • @Dominik Would you mind posting a new question with a reproducible example ? I could give it a shot – Steven Beaupré Dec 10 '15 at 18:03
  • 4
    Looks like you can do that the usual way with parallel's clusterExport if you make the cluster manually: cluster <- create_cluster(4) ; clusterExport(cluster,c("userfun1","userfun2","userfun3")) – Jan Stanstrup Jan 26 '16 at 21:18
8

You could check Hadley's new package multidplyr.

Maciej
  • 3,255
  • 1
  • 28
  • 43
5

According to https://twitter.com/cboettig/status/588068454239830017 this feature does not seem to be currently supported.