Why is MKL in parallel not faster than serial in R 3.6?

Question

I am trying to use Intel's MKL with R and adjust the number of threads using the MKL_NUM_THREADS variable.

It loads correctly, and I can see it using 3200% CPU in htop. However, it isn't actually faster than using only one thread.

I've been adapting Dirk Eddelbuettel's guide for centos, but I may have missed some flag or config somewhere.

Here is a simplified version of how I am testing how number of threads relates to job time. I do get expected results when using OpenBlas.

require(callr)
#> Loading required package: callr
f <- function(i)  r(function() crossprod(matrix(1:1e9, ncol=1000))[1], 
      env=c(rcmd_safe_env(),
            R_LD_LIBRARY_PATH=MKL_R_LD_LIBRARY_PATH, 
            MKL_NUM_THREADS=as.character(i), 
            OMP_NUM_THREADS="1")
)

system.time(f(1))
#>    user  system elapsed 
#>  14.675   2.945  17.789
system.time(f(4))
#>    user  system elapsed 
#>  54.528   2.920  19.598
system.time(f(8))
#>    user  system elapsed 
#> 115.628   3.181  20.364
system.time(f(32)) 
#>    user  system elapsed 
#> 787.188   7.249  36.388

^{Created on 2020-05-13 by the reprex package (v0.3.0)}

EDIT 5/18

Per the suggestion to try MKL_VERBOSE=1, I now see the following on stdout which shows it properly calling lapack:

MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 intel_thread
MKL_VERBOSE DSYRK(U,T,1000,1000000,0x7fff436222c0,0x7f71024ef040,1000000,0x7fff436222d0,0x7f7101d4d040,1000) 10.64s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1

for f(8), it shows NThr:8

MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 intel_thread
MKL_VERBOSE DSYRK(U,T,1000,1000000,0x7ffe6b39ab40,0x7f4bb52eb040,1000000,0x7ffe6b39ab50,0x7f4bb4b49040,1000) 11.98s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:8

I still am not getting any expected performance increase from extra cores.

EDIT 2

I am able to get the expected results using Microsoft's distribution of MKL, but not with Intel's official distribution as in the walkthrough. It appears that MS is using a GNU threading library; could the problem be in the threading library and not in blas/lapack itself?

Not all tasks can be parallelized. In an own test with Microsoft R and parallelized BLAS, a specific simulation was slower than single threaded - with higher CPU load. I remember a keynote from the useR conference in Brussels about parallelization that discussed such difficulties. The video should be somewhere on Microsoft channel 9. — tpetzoldt, May 13 '20 at 21:15
Surely matrix multiplication (crossprod is just X'X) is supported by MKL, though? — Neal Fultz, May 13 '20 at 21:22
Have you tried checking if the results are similar if you follow the guide to the end? Eg. using the same svg methods? If you are unable to replicate the results using the same methods, then there is likely a problem in the way you've setup the system. — Oliver, May 17 '20 at 16:47

Dirk Eddelbuettel · Accepted Answer · 2020-05-20T21:14:45.733

4

Only seeing this now: Did you check the obvious one, ie whether R on CentOS actually picks up the MKL?

As I recall, R on CentOS it is built in a more, ahem, "restricted" mode with the shipped-with-R reference BLAS. And if and when that is the case you simply cannot switch and choose another one as we have done within Debian and Ubuntu for 20+ years as that requires a different initial choice when R is compiled.

Edit: Per subsequent discussions (see comments below) we all re-realized that it is important to have the threading libraries / models aligned. The MKL is an Intel product and defaults to using their threading library, on Linux the GNU compiler is closer to the system and has its own. That latter one needs to be selected. In my writeup / script for the MKL on .deb systems I use

echo "MKL_THREADING_LAYER=GNU" >> /etc/environment

so set this "system-wide" on the machine, one can also add it just to the R environment files.

edited May 20 '20 at 21:14

answered May 16 '20 at 04:20

Dirk Eddelbuettel

360,940
56
644
725

I have it working with OpenBLAS and ATLAS using the same `R_LD_LIBRARY_PATH` pattern. I'm not sure if the sysadmin installed R from an official package or built from source. My best guess right now is this is an MKL-specific config issue, especially since I do see it forking 32 times, I suspect there's some less-than-well-documented configuration I need to set. – Neal Fultz May 17 '20 at 16:33
1

I can confirm the we are using R from epel, and also that the centos OpenBlas package is very conservative for whatever reason. – Neal Fultz May 18 '20 at 21:14
Just noticed your updated comment and YES there can be threading library problems. That stuff is hard to find good documentation on, but IIRC there is something hidden away in the issue or prs for my MKL script repo. – Dirk Eddelbuettel May 19 '20 at 16:06
2

Setting the threading library to GNU seems to fix this - if you edit your answer I'll accept it. – Neal Fultz May 20 '20 at 20:56
1

Aweome news, glad you stuck that out! – Dirk Eddelbuettel May 20 '20 at 21:11

score 0 · Answer 2 · answered May 16 '20 at 03:54

I am not sure exactly how R call MKL but if the crossprod function calls mkl's gemm underneath then we have to see very good scalability results with such inputs. What is the input problem sizes? MKL supports the verbose mode. This option could help to see the many useful runtime info when dgemm will be running. Could you try to export the MKL_VERBOSE=1 environment and see the log file? Though, I am not pretty sure if R will not suppress the output.

Why is MKL in parallel not faster than serial in R 3.6?

2 Answers2