Divide et impera on a data frame in R

Question

As we all know R isn't the most efficient platform to run large analyses. If I had a large data frame containing three parameters:

GROUP   X  Y
A       1  2
A       2  2
A       2  3
...
B       1  1
B       2  3
B       1  4
...
millions of rows

and I wanted to run a computation on each group (e.g. compute Pearson's r on X,Y) and store the results in a new data frame, I can do it like this:

df = loadDataFrameFrom( someFile )
results = data.frame()
for ( g in unique( df$GROUP)) ){
    gdf <- subset( df, df$GROUP == g )
    partialRes <- slowStuff( gdf$X,gdf$Y )
    results = rbind( results, data.frame( GROUP = g, RES = partialRes ) )
}
// results contains all the results here.
useResults(results)

The obvious problem is that this is VERY slow, even on powerful multi-core machine.

My question is: is it possible to parallelise this computation, having for example a separate thread for each group or a block of groups? Is there a clean R pattern to solve this simple divide et impera problem?

Thanks, Mulone

how it could be parallelised would very much depend on the type of computation wouldn't it? — Reuben L., May 04 '12 at 16:10
Did you deliberately try to do everything you could to make this as slow as possible? I'm not sure you could have written this any more inefficiently. — Joshua Ulrich, May 04 '12 at 16:20
+1 to Josh's comment. `rbind` inside a loop comes up so often, perhaps R itself could detect and warn about it. The warning message could be "rbind detected in last line of for() loop, this may be very slow. See XYZ reference for advice.". One if() statement in the parser would be needed, maybe? — Matt Dowle, May 04 '12 at 16:29
@MatthewDowle If that is implemented I would _insist_ that it be accompanied by a little pop-up of Clippy saying "It looks like you're trying to grow an object in a for loop..." :) — joran, May 04 '12 at 16:34
Maybe Revolution would consider this as a further distinguishing feature? — Dirk Eddelbuettel, May 04 '12 at 16:38

score 6 · Accepted Answer · answered May 04 '12 at 16:15

First off, R is not necessarily slow. Its speed depends largely on using it correctly, just like any language. There are a few things that can speed up your code without altering much: preallocate your results data.frame before you begin; use a list and matrix or vector construct instead of a data.frame; switch to use data.table; the list goes on, but The R Inferno is an excellent place to start.

Also, take a look here. It provides a good summary on how to take advantage of multi-core machines.

The "clean R pattern" was succinctly solved by Hadley Wickam with his plyr package and specifically ddply:

library(plyr)
library(doMC)
registerDoMC()
ddply(df, .(GROUP), your.function, .parallel=TRUE)

However, it is not necessarily fast. You can use something like:

library(parallel)
mclapply(unique(df$GRUOP), function(x, df)  ...)

Or finally, you can use the foreach package:

foreach(g = unique(df$Group), ...) %dopar$ {
   your.analysis
}

The R Inferno looks great, thank you very much for the hint! It would nice to write up an R tutorial for Java/C programmers, showing the key differences... — Mulone, May 06 '12 at 11:12

score 5 · Answer 2 · answered May 04 '12 at 16:35

To back up my comment: 10 million rows, 26 groups. Done in < 3 seconds on a single-core 3.3Ghz CPU. Using only base R. No parallelization needed.

> set.seed(21)
> x <- data.frame(GROUP=sample(LETTERS,1e7,TRUE),X=runif(1e7),Y=runif(1e7))
> system.time( y <- do.call(rbind, lapply(split(x,x$GROUP),
+     function(d) data.frame(GROUP=d$GROUP[1],cor=cor(d$X,d$Y)))) )
   user  system elapsed 
   2.37    0.56    2.94 
> y
  GROUP           cor
A     A  2.311493e-03
B     B -1.020239e-03
C     C -1.735044e-03
D     D  1.355110e-03
E     E -8.027199e-04
F     F  8.234086e-04
G     G  2.337217e-04
H     H -5.861781e-04
I     I  7.799191e-04
J     J  1.063772e-04
K     K  7.174137e-04
L     L  4.151059e-04
M     M  4.440694e-04
N     N  2.568411e-03
O     O -3.827366e-04
P     P -1.239380e-03
Q     Q -1.057020e-03
R     R  1.079676e-03
S     S -1.819232e-03
T     T -3.577533e-04
U     U -1.084114e-03
V     V  6.686503e-05
W     W -1.631912e-03
X     X  8.668508e-04
Y     Y -6.460281e-04
Z     Z  1.614978e-03

By the way, parallelization will only help if your slowStuff function is the bottleneck. Your use of rbind in a loop is likely the bottleneck, unless you do something similar in slowStuff.

I wasn't really focusing on the rbind part, which is very slow indeed, but on the slowStuff part. That's why I wanted to parallelise the computation. I wasn't going for a faster linear approach, but for a parallel one. — Mulone, May 04 '12 at 21:18
@Mulone: I would encourage you to profile your code before throwing more cores at the problem. You may be surprised to find that the faster linear approach is still faster than the parallel approach. — Joshua Ulrich, May 04 '12 at 21:23

score 2 · Answer 3 · answered May 04 '12 at 16:20

2

I think your slowness is in part due to your non R programming in R. The following would give you correlations per group (I used the mtcars data set and divided it by cyl group) and do it pretty fast:

by(mtcars, mtcars$cyl, cor)

answered May 04 '12 at 16:20

Tyler Rinker

108,132
65
322
519

Divide et impera on a data frame in R

3 Answers3