When using mclapply, each single core is slower than its unparallelized version

Question

I am learning about parallel computing in R , and I found this happening in my experiments.

Briefly, in the following example, why are most values of 'user' in t smaller than that in mc_t ? My machine has 32GB memory, 2 cpus with 4 cores and 8 hyper threads in total.

system.time({t = lapply(1:4,function(i) {
    m = matrix(1:10^6,ncol=100)
    t = system.time({
        m%*%t(m)
    })
    return(t)
})})


library(multicore)
system.time({
    mc_t = mclapply(1:4,function(m){
        m = matrix(1:10^6,ncol=100)
        t = system.time({
            m%*%t(m)
        })
        return(t)
    },mc.cores=4)
})

> t
[[1]]
user  system elapsed
11.136   0.548  11.703

[[2]]
user  system elapsed
11.533   0.548  12.098

[[3]]
user  system elapsed
11.665   0.432  12.115

[[4]]
user  system elapsed
11.580   0.512  12.115

> mc_t
[[1]]
user  system elapsed
16.677   0.496  17.199

[[2]]
user  system elapsed
16.741   0.428  17.198

[[3]]
user  system elapsed
16.653   0.520  17.198

[[4]]
user  system elapsed
11.056   0.444  11.520

And sessionInfo():

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
    [1] multicore_0.1-7

To clarify: Sorry that my decription may be ambiguous. I understand that parallel is still quicker for the whole mission. However, the time-counter is just in the function for calculation, the time of set-up overhead for each child process in mclapply is not taken into consideration. So I am still confused why is this pure calculation(i.e., m%*%t(m)) step slower.

My wild-ass guess is that the set-up overhead for each child process is the difference. This isn't really how one uses multicore: try comparing a single core doing `matrix(4*10^6,4000,1000)` with a `mcapply` which makes four 1000x1000 matrices and combines the returned objects. — Carl Witthoft, Feb 12 '14 at 14:46
@CarlWitthoft is correct. You are simply measuring the overhead to communicate between the cores. My interpretation of the results is that with a single core your code takes ~12 seconds to run. So running it 4 times will take ~48 seconds. With multicore the entire process takes 16 seconds for 4 results. That extra 4 seconds is the penalty you incur in commmunication between cores. — Andrie, Feb 12 '14 at 15:08
Thank you @CarlWitthoft and @Andrie , And it is my bad to not describing it clearly. I understand that parallel is still quicker for the whole mission. However, the time-counter is just in the function for calculation, the time of set-up overhead for each child process in `mclapply` is not taken into consideration. So I am still confused why is this pure calculation step slower. — TomHall, Feb 12 '14 at 15:38

Steve Weston · Answer 1 · 2014-02-13T13:58:03.177

1

I would guess that the timing difference is due to a resource contention between the cores, possibly for memory or cache, particularly if your CPU has a cache that is shared between cores. Even if there is plenty of main memory, there can be contention accessing it, causing the performance to not scale linearly with the number of cores.

Note that the %*% operator will make use of multiple cores if your R installation uses a multi-threaded math library such as MKL or ATLAS. By using multiple processes on top of that, you could have many more threads than cores, hurting your performance.

edited Feb 13 '14 at 13:58

answered Feb 12 '14 at 18:07

Steve Weston

19,197
4
59
75

I manually ran several Rscript commands and compared their running time. Since they were automatically running on different cores, I think you are right. But could you please explain your comment on 'BLAS library'? I cannot see the relationship between BLAS and performance differnece. Thanks! – TomHall Feb 13 '14 at 03:50

score 0 · Answer 2 · answered Jan 18 '19 at 14:05

The theoretical best possible speed up of the parallel algorithm is calculated as

S(n) = T(1) / T(n) = 1 / (P + (1 - P) / n)

where T(n) is the time taken to execute the task using n parallel process and P is the proportion of the whole task that is strictly serial. With this formula, lets compute the theoretical best possible speed up of the parallel algorithm with 4 processors given that a task takes 10 seconds to execute on one processor.

1 / (0.5 + (1-0.5)/4) = 1.6x

Using this information we can say that each processors are slower than the one processor version of the algorithm because the speed up is less than 4x. But this is simply how it works. This situation is called Amdahl's Law. Amdahl's Law gives us the estimate of the maximum speed up possible and it does not account for the overheads. You can read it further here.

To compute the time instead of the speed up rate we use the formula

T(n) = T(1)*(P + (1-P)/n)

Let's compute the best possible runtime of a task that takes 10 seconds on a single processor.

T(4) = 10 * (0.5 + (1 - 0.5)/4) = 6.25

When using mclapply, each single core is slower than its unparallelized version

2 Answers2