Running iml package in parallel results in slower wall-clock time in R?

Question

So I am following the example in the iml vignette about running the calculations in parallel found here. However, I am having a couple of issues that I'm not understanding.

Firstly, in the example he calls both the future library and the future.callr library and then creates a PSOCK cluster with 2 cores, like so:

library("future")
library("future.callr")
# Creates a PSOCK cluster with 2 cores
plan("callr", workers = 2)

However, this doesn't work at all for me.... if I use plan("callr", workers = 2) then, when I try and do any calculation, it just hangs forever until I terminate the process.

Instead I'm using plan(cluster), which at least seems to complete the calculation. However, if I continue to follow the example in the vignette to calculate the interaction strength, the process time is indeed much quicker... but the wall-clock time is now considerably slower. The code below outlines this:

library("iml")
library("randomForest")
library("future") # used for parallel computing
library("bench") # used to measure system time

# Get data
data("Boston", package = "MASS")
X <- Boston[which(names(Boston) != "medv")]

# create randomForest model
rf <- randomForest(medv ~ ., data = Boston)


# iml predictor
predictor <- Predictor$new(rf, data = X, y = Boston$medv)

# run interaction calc sequentially
system_time({
  plan(sequential)
  Interaction$new(predictor)
})
# process = 15.9s  real = 11.2s

# run interaction calc in parallel
system_time({
  plan(cluster, workers = 2)
  Interaction$new(predictor)
})
# process = 760ms  real = 15.1s

So, as can be seen above, the process time is much quicker. But the real time is notably slower, which seems to slightly defeat the purpose of parallel computing!? And this issue seems to become more prevalent when you increase the number of variables/observations. When I use a dataset with 10 variables and 300 observations, the real time with no parallel = ~30s and with parallel = ~50s.

My question is, what is going on here? Am I missing some fundamental idea about parallel computing, or am I implementing it wrong? Why would the wall-clock (real) time be so considerably slower when doing parallel computing?

[Bonus question] What is the difference between cores and workers? The future package has 2 functions called availableCores and availableWorkers, but Im not sure what the difference is?

score 0 · Answer 1 · answered Jun 23 '20 at 19:04

Going parallel is not a panacea. If it takes more time to pass data to and from the workers than you save by crunching data in parallel, then Wall-clock time will be greater. cores represents how many physical CPU cores exist or are available for assignment. workers is how many processes you want to distribute among the available or assigned cores .

You haven't told us what your Mac's processor chip is & how many physical cores it has, so it's hard to comment on the optimal number of workers to create.

Beyond that, I might recommend looking at bigparallelr and parallel packages both to learn more about usage and to see if they better fit your needs.

Running iml package in parallel results in slower wall-clock time in R?

1 Answers1