R parallel execution

Question

I have a dataframe containing 5 columns

COL1 | COL2 | COL 3 | COL 4 | COL 5

I need to aggregate on COL1 and apply 4 different function on COL2 to COL5 columns

    a1<-aggregate( COL2 ~ COL1, data = dataframe, sum)
    a2<-aggregate( COL3 ~ COL1, data = dataframe, length)
    a3<-aggregate( COL4 ~ COL1, data = dataframe, max)
    a4<-aggregate( COL5 ~ COL1, data = dataframe, min)

finalDF<- Reduce(function(x, y) merge(x, y, all=TRUE), list(a1,a2,a3,a4))

1)I have 24 cores on the machine. How can I execute above 4 lines of code (a1,a2,a3,a4) in parallel? I want to use 4 cores simultaneously and then use Reduce to compute finalDF

2) Can I use different function on different column in one aggregate (I can use one fun on multiple column and I can also use multiple function on one column in aggregate but I was unable to apply multiple functions on different columns [COL2-sum,COL3-length,COL4-max,COL5-min])

If `aggregate` is too slow, you should probably use data.table or dplyr instead. Speed gains can be expected to be much better than a factor of 4. — Roland, May 22 '14 at 15:17
Please provide a reproducible example, and explain the dimensions of your data, with the total number of unique groups you've in your real data set.' Even better if you can provide code to generate sample data representative of your real dataset. — Arun, May 22 '14 at 21:35

score 3 · Answer 1 · answered May 22 '14 at 21:12

This is an example of how you might do it with dplyr as suggested by @Roland

set.seed(2)
df <- data.frame(COL1 = sample(LETTERS, 1e6, replace=T),
             COL2 = rnorm(1e6),
             COL3 = runif(1e6, 100, 1000),
             COL4 = rnorm(1e6, 25, 100),
             COL5 = runif(1e6, -100, 10))

#> head(df)
#  COL1      COL2     COL3       COL4       COL5
#1    E 1.0579823 586.2360  -3.157057 -14.462318
#2    S 0.1238110 872.3868 129.579090   9.525772
#3    O 0.4902512 498.0537  93.063487   1.910506
#4    E 1.7215843 200.7077 126.716256  -5.865204
#5    Y 0.6515853 275.3369  12.554218 -26.301225
#6    Y 0.7959678 134.4977  54.789415 -33.145334

require(dplyr)

df <- df %.%
  group_by(COL1) %.%
  summarize(a1 = sum(COL2),
            a2 = length(COL3),
            a3 = max(COL4),
            a4 = min(COL5))      #add as many calculations as you like

On my machine this took 0.064 seconds.

#> head(df)
#Source: local data frame [6 x 5]
#
#  COL1           a1    a2       a3        a4
#1    A   -0.9068368 38378 403.4208 -99.99943
#2    B    6.0557452 38551 419.0970 -99.99449
#3    C  108.5680251 38673 491.8061 -99.99382
#4    D  -34.1217133 38469 481.0626 -99.99697
#5    E  -68.2998926 38168 452.8280 -99.99602
#6    F -185.9059338 38159 417.2271 -99.99995

R parallel execution

1 Answers1