As we all know R isn't the most efficient platform to run large analyses. If I had a large data frame containing three parameters:
GROUP X Y
A 1 2
A 2 2
A 2 3
...
B 1 1
B 2 3
B 1 4
...
millions of rows
and I wanted to run a computation on each group (e.g. compute Pearson's r on X,Y) and store the results in a new data frame, I can do it like this:
df = loadDataFrameFrom( someFile )
results = data.frame()
for ( g in unique( df$GROUP)) ){
gdf <- subset( df, df$GROUP == g )
partialRes <- slowStuff( gdf$X,gdf$Y )
results = rbind( results, data.frame( GROUP = g, RES = partialRes ) )
}
// results contains all the results here.
useResults(results)
The obvious problem is that this is VERY slow, even on powerful multi-core machine.
My question is: is it possible to parallelise this computation, having for example a separate thread for each group or a block of groups? Is there a clean R pattern to solve this simple divide et impera problem?
Thanks, Mulone