I am trying to use Intel's MKL with R and adjust the number of threads using the MKL_NUM_THREADS variable.
It loads correctly, and I can see it using 3200% CPU in htop. However, it isn't actually faster than using only one thread.
I've been adapting Dirk Eddelbuettel's guide for centos, but I may have missed some flag or config somewhere.
Here is a simplified version of how I am testing how number of threads relates to job time. I do get expected results when using OpenBlas.
require(callr)
#> Loading required package: callr
f <- function(i) r(function() crossprod(matrix(1:1e9, ncol=1000))[1],
env=c(rcmd_safe_env(),
R_LD_LIBRARY_PATH=MKL_R_LD_LIBRARY_PATH,
MKL_NUM_THREADS=as.character(i),
OMP_NUM_THREADS="1")
)
system.time(f(1))
#> user system elapsed
#> 14.675 2.945 17.789
system.time(f(4))
#> user system elapsed
#> 54.528 2.920 19.598
system.time(f(8))
#> user system elapsed
#> 115.628 3.181 20.364
system.time(f(32))
#> user system elapsed
#> 787.188 7.249 36.388
Created on 2020-05-13 by the reprex package (v0.3.0)
EDIT 5/18
Per the suggestion to try MKL_VERBOSE=1, I now see the following on stdout which shows it properly calling lapack:
MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 intel_thread
MKL_VERBOSE DSYRK(U,T,1000,1000000,0x7fff436222c0,0x7f71024ef040,1000000,0x7fff436222d0,0x7f7101d4d040,1000) 10.64s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
for f(8), it shows NThr:8
MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 intel_thread
MKL_VERBOSE DSYRK(U,T,1000,1000000,0x7ffe6b39ab40,0x7f4bb52eb040,1000000,0x7ffe6b39ab50,0x7f4bb4b49040,1000) 11.98s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8
I still am not getting any expected performance increase from extra cores.
EDIT 2
I am able to get the expected results using Microsoft's distribution of MKL, but not with Intel's official distribution as in the walkthrough. It appears that MS is using a GNU threading library; could the problem be in the threading library and not in blas/lapack itself?