r doParallel qbeta vs RcppParallel - how to use cores at full capacity

Question

I run the following sample code to simulate values and below is snapshot of usage of 4 cores. It takes a while to use all cores at full capacity, I'd like to understand what's going on and ultimately how to make it faster.

library(doParallel)
library(data.table)

data<-data.table(a=runif(10000000),b=runif(10000000),quantile=runif(10000000))

e <- nrow(data)%/%1000000+1 
dataSplit<-split(data[],seq_len(nrow(data))%/%1000000)
qbetaVec<-function(lossvalues) qbeta(lossvalues$quantile,lossvalues$a,lossvalues$b)

cl <- makeCluster(4)
registerDoParallel(cl)
res2<-foreach(i=1:e) %dopar% qbetaVec(dataSplit[[i]])
res3<-unlist(res2)

It takes about 67 secs to complete on my machine. I had a look at the performance monitor while res2 was running and it looks like it takes a while to use all 4 cores at full capacity. I'd like to understand what is the reason for this. Is it unavoidable? What is going on before all cores are utilized at full capacity? Would it be faster to try this with RcppParallel?

Is this a toy example? Because you could make it faster without parallelization. — Roland, Jul 25 '17 at 05:02
@Roland i have used qbeta by itself since it is vectorized and it is faster (about 2x). id like to see if parallelization can improve it further though. — charliealpha, Jul 25 '17 at 05:27

score 1 · Answer 1 · answered Jul 25 '17 at 06:25

Parallelization involves overhead, notably transfer of data to and from the workers. Also, if you only use four workers and each task takes about equally long, it doesn't make sense to split this into more than four tasks.

library(microbenchmark)

microbenchmark(
  OP = {
    e <- nrow(data)%/%1000000+1 
    dataSplit<-split(data[],seq_len(nrow(data))%/%1000000)
    qbetaVec<-function(lossvalues) qbeta(lossvalues$quantile,lossvalues$a,lossvalues$b)

    cl <- makeCluster(4)
    registerDoParallel(cl)
    res2<-foreach(i=1:e) %dopar% qbetaVec(dataSplit[[i]])
    res3<-unlist(res2)
    stopCluster(cl)
  },
  OP_4split = {
    e <- 4 
    dataSplit<-split(data[],seq_len(nrow(data)) %% e) #note this change
    qbetaVec<-function(lossvalues) qbeta(lossvalues$quantile,lossvalues$a,lossvalues$b)

    cl <- makeCluster(e)
    registerDoParallel(cl)
    res2<-foreach(i=1:e) %dopar% qbetaVec(dataSplit[[i]])
    res3<-unlist(res2)
    stopCluster(cl)
  },
  serial = {
    res3 <- data[, qbeta(quantile, a, b)]
  },
  times = 3
)

#Unit: seconds
#      expr      min       lq     mean   median       uq      max neval
#        OP 17.31950 17.35962 17.37491 17.39975 17.40262 17.40549     3
# OP_4split 15.98415 16.03414 16.10776 16.08413 16.16957 16.25500     3
#    serial 22.62642 22.64165 22.66247 22.65689 22.68050 22.70411     3

It's only slightly better with 4 chunks. However, there is really a lot of data that has to be transferred and reassembled. Splitting the data is also a costly operation. I wouldn't bother with parallelization here.

thank you for your input. may i ask what processor do you use? it's 4x faster than time. i should look into improving hardware too. — charliealpha, Jul 25 '17 at 06:32
regarding splitting of data, i tried this instead and it saves time but not sure: e <- nrow(data)%/%100000 data[,grp:=seq_len(.N)%/%100000] res2<-foreach(i=0:e, .packages = "data.table") %dopar% qbetaVec(data[grp==i]) — charliealpha, Jul 25 '17 at 06:33

r doParallel qbeta vs RcppParallel - how to use cores at full capacity

1 Answers1