2

I run the following sample code to simulate values and below is snapshot of usage of 4 cores. It takes a while to use all cores at full capacity, I'd like to understand what's going on and ultimately how to make it faster.

library(doParallel)
library(data.table)

data<-data.table(a=runif(10000000),b=runif(10000000),quantile=runif(10000000))

e <- nrow(data)%/%1000000+1 
dataSplit<-split(data[],seq_len(nrow(data))%/%1000000)
qbetaVec<-function(lossvalues) qbeta(lossvalues$quantile,lossvalues$a,lossvalues$b)

cl <- makeCluster(4)
registerDoParallel(cl)
res2<-foreach(i=1:e) %dopar% qbetaVec(dataSplit[[i]])
res3<-unlist(res2)

It takes about 67 secs to complete on my machine. I had a look at the performance monitor while res2 was running and it looks like it takes a while to use all 4 cores at full capacity. I'd like to understand what is the reason for this. Is it unavoidable? What is going on before all cores are utilized at full capacity? Would it be faster to try this with RcppParallel?

enter image description here

charliealpha
  • 307
  • 2
  • 12
  • Is this a toy example? Because you could make it faster without parallelization. – Roland Jul 25 '17 at 05:02
  • @Roland i have used qbeta by itself since it is vectorized and it is faster (about 2x). id like to see if parallelization can improve it further though. – charliealpha Jul 25 '17 at 05:27

1 Answers1

1

Parallelization involves overhead, notably transfer of data to and from the workers. Also, if you only use four workers and each task takes about equally long, it doesn't make sense to split this into more than four tasks.

library(microbenchmark)

microbenchmark(
  OP = {
    e <- nrow(data)%/%1000000+1 
    dataSplit<-split(data[],seq_len(nrow(data))%/%1000000)
    qbetaVec<-function(lossvalues) qbeta(lossvalues$quantile,lossvalues$a,lossvalues$b)

    cl <- makeCluster(4)
    registerDoParallel(cl)
    res2<-foreach(i=1:e) %dopar% qbetaVec(dataSplit[[i]])
    res3<-unlist(res2)
    stopCluster(cl)
  },
  OP_4split = {
    e <- 4 
    dataSplit<-split(data[],seq_len(nrow(data)) %% e) #note this change
    qbetaVec<-function(lossvalues) qbeta(lossvalues$quantile,lossvalues$a,lossvalues$b)

    cl <- makeCluster(e)
    registerDoParallel(cl)
    res2<-foreach(i=1:e) %dopar% qbetaVec(dataSplit[[i]])
    res3<-unlist(res2)
    stopCluster(cl)
  },
  serial = {
    res3 <- data[, qbeta(quantile, a, b)]
  },
  times = 3
)

#Unit: seconds
#      expr      min       lq     mean   median       uq      max neval
#        OP 17.31950 17.35962 17.37491 17.39975 17.40262 17.40549     3
# OP_4split 15.98415 16.03414 16.10776 16.08413 16.16957 16.25500     3
#    serial 22.62642 22.64165 22.66247 22.65689 22.68050 22.70411     3

It's only slightly better with 4 chunks. However, there is really a lot of data that has to be transferred and reassembled. Splitting the data is also a costly operation. I wouldn't bother with parallelization here.

Roland
  • 127,288
  • 10
  • 191
  • 288
  • thank you for your input. may i ask what processor do you use? it's 4x faster than time. i should look into improving hardware too. – charliealpha Jul 25 '17 at 06:32
  • regarding splitting of data, i tried this instead and it saves time but not sure: e <- nrow(data)%/%100000 data[,grp:=seq_len(.N)%/%100000] res2<-foreach(i=0:e, .packages = "data.table") %dopar% qbetaVec(data[grp==i]) – charliealpha Jul 25 '17 at 06:33
  • @charliealphan i7-4790K @ 4.00GHz – Roland Jul 25 '17 at 08:20