1

I have a slow function that I want to apply to each row in a data.frame. The computation is embarrassingly parallel.

I have 4 cores, but R's built in functions only uses one.

All I want to do is a parallel equivalent to:

data$c = slow.foo(data$a, data$b)

I can't find clear instructions on which library to use (overwhelmed by choice) and how to use it. Any help would be greatly appreciated.

sharoz
  • 6,157
  • 7
  • 31
  • 57

1 Answers1

3

The parallel package is included with base R. Here's a quick example using parApply from that package:

library(parallel)

# Some dummy data
d <- data.frame(x1=runif(1000), x2=runif(1000))

# Create a cluster with 1 fewer cores than are available. Adjust as necessary
cl <- makeCluster(detectCores() - 1)

# Just like regular apply, but rows get sent to the various processes
out <- parApply(cl, d, 1, function(x) x[1] - x[2])

stopCluster(cl)

# Same as x1 - x2?
identical(out, d$x1 - d$x2)

# [1] TRUE

You also have, e.g., parSapply and parLapply at your disposal.

Of course, for the example I've given, the vectorised operation d$x1 - d$x2 is much faster. Think about whether your processes can be vectorised rather than performed row by row.

jbaums
  • 27,115
  • 5
  • 79
  • 119
  • Thanks for the info. Your example works, but I can't get a named equivalent to work. I changed `x[1]-x[2]` to `x$x1-x$x2`. That yields the following error: `Error in checkForRemoteErrors(val) : 7 nodes produced errors; first error: $ operator is invalid for atomic vectors` – sharoz Jun 22 '14 at 03:42
  • 1
    @sharoz: This is because the rows are simplified to vectors (for which `$` subsetting is not an option). The same would occur for the non-parallel `apply(d, 1, function(x) x$x1 - x$x2)`. If you want to use names, you can do: `function(x) x['x1'] - x['x2']`. – jbaums Jun 22 '14 at 03:45
  • Thanks again, but it's hitting problems with named variables. Per the original example, I defined `bar <- function(x) slow.foo(x['a'], x['b'])`. Then I ran `parApply(cl, d, 1, bar)` which can't find the symbol slow.foo (even though it's defined). Any ideas? – sharoz Jun 22 '14 at 04:06
  • 1
    @sharoz: Other than the object that you pass to `parApply` (i.e., `d`), you need to send objects/functions to the cluster processes with `clusterExport`. Try: `clusterExport(cl, 'slow.foo')` prior to `parApply`. This character vector should also include names of any objects referred to within the body of `slow.foo` (other than those that are passed from `bar`). Be aware that if any of said objects are huge, you might run into memory problems by sending copies to all processes. – jbaums Jun 22 '14 at 04:16