MKL BLAS not multithreading zgemv

Question

I'm running a very simple MKL BLAS matrix-matrix and matrix-vector multiplication on a computer with two AMD EPYC 7443 24-Core Processors and 1007GB RAM.

The code, compiling line and test results are given at the end of this post.

BLAS is apparently not multithreading the mat-vec operation, but only the mat-mat as you can see below.

How can I make the mat-vec operation multithreaded? What am I doing wrong?

Here's the code:

program main

  use blas95
  
  implicit none

  integer, parameter :: lp = kind(DBLE(1.0))
  integer :: m, n, i
  complex(kind=lp), dimension(:), allocatable :: x, y
  complex(kind=lp), dimension(:,:), allocatable :: A, B, C

  m=2**12
  n=2**12

  allocate(A(m,n))
  allocate(B(m,n),C(m,n))
  allocate(x(n),y(m))

  do i=0,5
     call mkl_set_num_threads_local(2**i)
     call mkl_set_dynamic(0)
     call gemm(A,B,C)
  end do
  do i=0,5
     call mkl_set_num_threads_local(2**i)
     call mkl_set_dynamic(0)
     call gemv(A,x,y)
  end do

end program main

Here's my compile line:

gfortran -Ofast -I$MKLROOT/include -I$BLASROOT/include/intel64/lp64  main.F90 -L$MKLROOT/lib/intel64 -o main -lgomp -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core $BLASROOT/lib/intel64/libmkl_blas95_lp64.a

Here's the output:

MKL_VERBOSE oneMKL 2022.0 Product build 20211112 for Intel(R) 64 architecture Intel(R) Architecture processors, Lnx 1.79GHz lp64 gnu_thread
MKL_VERBOSE ZGEMM(N,N,4096,4096,4096,0x7fff21099cf0,0x154a1f17b010,4096,0x154a0f17a010,4096,0x7fff21099ce0,0x1549ff179010,4096) 10.94s CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:1
MKL_VERBOSE ZGEMM(N,N,4096,4096,4096,0x7fff21099cf0,0x154a1f17b010,4096,0x154a0f17a010,4096,0x7fff21099ce0,0x1549ff179010,4096) 5.90s CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:2
MKL_VERBOSE ZGEMM(N,N,4096,4096,4096,0x7fff21099cf0,0x154a1f17b010,4096,0x154a0f17a010,4096,0x7fff21099ce0,0x1549ff179010,4096) 3.76s CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:4
MKL_VERBOSE ZGEMM(N,N,4096,4096,4096,0x7fff21099cf0,0x154a1f17b010,4096,0x154a0f17a010,4096,0x7fff21099ce0,0x1549ff179010,4096) 1.59s CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:8
MKL_VERBOSE ZGEMM(N,N,4096,4096,4096,0x7fff21099cf0,0x154a1f17b010,4096,0x154a0f17a010,4096,0x7fff21099ce0,0x1549ff179010,4096) 925.07ms CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:16
MKL_VERBOSE ZGEMM(N,N,4096,4096,4096,0x7fff21099cf0,0x154a1f17b010,4096,0x154a0f17a010,4096,0x7fff21099ce0,0x1549ff179010,4096) 606.32ms CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:32
MKL_VERBOSE ZGEMV(N,4096,4096,0x7fff21099d10,0x154a1f17b010,4096,0x1d59930,1,0x7fff21099d00,0x1d69940,1) 12.23ms CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:1
MKL_VERBOSE ZGEMV(N,4096,4096,0x7fff21099d10,0x154a1f17b010,4096,0x1d59930,1,0x7fff21099d00,0x1d69940,1) 11.68ms CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:2
MKL_VERBOSE ZGEMV(N,4096,4096,0x7fff21099d10,0x154a1f17b010,4096,0x1d59930,1,0x7fff21099d00,0x1d69940,1) 11.66ms CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:4
MKL_VERBOSE ZGEMV(N,4096,4096,0x7fff21099d10,0x154a1f17b010,4096,0x1d59930,1,0x7fff21099d00,0x1d69940,1) 11.62ms CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:8
MKL_VERBOSE ZGEMV(N,4096,4096,0x7fff21099d10,0x154a1f17b010,4096,0x1d59930,1,0x7fff21099d00,0x1d69940,1) 11.64ms CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:16
MKL_VERBOSE ZGEMV(N,4096,4096,0x7fff21099d10,0x154a1f17b010,4096,0x1d59930,1,0x7fff21099d00,0x1d69940,1) 11.60ms CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:32

And here's a test result of only the mat-vec but with a larger matrix and vector:

MKL_VERBOSE oneMKL 2022.0 Product build 20211112 for Intel(R) 64 architecture Intel(R) Architecture processors, Lnx 1.79GHz lp64 gnu_thread
MKL_VERBOSE ZGEMV(N,65536,65536,0x7fff04973380,0x14f20a01e010,65536,0x1502125d9010,1,0x7fff04973370,0x14d209f1b010,1) 4.89s CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:1
MKL_VERBOSE ZGEMV(N,65536,65536,0x7fff04973380,0x14f20a01e010,65536,0x1502125d9010,1,0x7fff04973370,0x14d209f1b010,1) 4.87s CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:2
MKL_VERBOSE ZGEMV(N,65536,65536,0x7fff04973380,0x14f20a01e010,65536,0x1502125d9010,1,0x7fff04973370,0x14d209f1b010,1) 4.90s CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:4
MKL_VERBOSE ZGEMV(N,65536,65536,0x7fff04973380,0x14f20a01e010,65536,0x1502125d9010,1,0x7fff04973370,0x14d209f1b010,1) 4.90s CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:8
MKL_VERBOSE ZGEMV(N,65536,65536,0x7fff04973380,0x14f20a01e010,65536,0x1502125d9010,1,0x7fff04973370,0x14d209f1b010,1) 4.90s CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:16
MKL_VERBOSE ZGEMV(N,65536,65536,0x7fff04973380,0x14f20a01e010,65536,0x1502125d9010,1,0x7fff04973370,0x14d209f1b010,1) 4.90s CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:32

Edit 1: Manually parallelizing with OpenMP yields the results below, proving that multithread parallelization is profitable for 4 threads or more.

 A(65536,65536)Time[s]=   4.71       NThr=           1
 A(65536,65536)Time[s]=   6.46       NThr=           2
 A(65536,65536)Time[s]=   3.15       NThr=           4
 A(65536,65536)Time[s]=   1.71       NThr=           8
 A(65536,65536)Time[s]=   1.18       NThr=          16
 A(65536,65536)Time[s]=   1.21       NThr=          32

Is it even profitable to parallelize this operation? Is it expected to be faster? — Vladimir F Героям слава, Apr 13 '23 at 06:22
See https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Only-10-speed-increase-for-dgemv/m-p/914815#M12473 Also related https://stackoverflow.com/questions/14325381/alg-mkl-threaded-dgemv — Vladimir F Героям слава, Apr 13 '23 at 07:26
The MKL has (probably hardware-dependent) runtime thresholds that decide if it's worth multithreading a given routine, and on how many threads. It seems that a matrix-vector multiplication is never worth multithreading, which makes sense (as it is a memory-bound operation on most hardwares) — PierU, Apr 13 '23 at 07:45
The AMD EPYC 7443 is an AMD processor with a NUMA architecture. I am not sure the MKL is optimized for that (past experiments seems to indicate that no). The current throughput is ~10.8 GiB/s which is not bad but also not great for this processor. It seems the MKL do use >1 thread since the computation is faster with two threads. maybe this is a good idea to check that. Timings are rarely enough to understand how well a system performs. — Jérôme Richard, Apr 13 '23 at 09:20
@VladimirFГероямслава Yes it is expected to be faster, see Edit 1. I can tell it's not using threads from htop. The larger point is: How do I make MKL parallelize this operation? — Astor, Apr 13 '23 at 15:49
It is faster (well, not with 2 threads, which is strange), but with a poor scaling. Intel maybe considers the efficiency to decide which routines can be multithreaded or not, which is a conservative approach. — PierU, Apr 13 '23 at 15:58
I cannot reproduce the problem. On my old Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz I get 4.43, 2.28, 1.16, 2.22, 0.685 ms for 1, 2, 3, 4 threads. I used Intel Fortran. — Vladimir F Героям слава, Apr 13 '23 at 16:11
@VladimirFГероямслава Thank you! Could you try with gfortran? — Astor, Apr 13 '23 at 16:14
Maybe later, now I have to go and I am getting a segmentation fault, maybe I have to recompile BLAS95. — Vladimir F Героям слава, Apr 13 '23 at 16:26
The results with gfortran are much worse, 18.22, 21.79, 14.16, 15.81. I think it does use threading but it is much less efficient. Also, even the single thread time is much worse. I do not know why. — Vladimir F Героям слава, Apr 14 '23 at 09:01
Actually, even with Intel it does not look as nice as it did yesterday. 11.4, 6.5, 4.9, 10.2. The 8-thread increase is not surprising, it is a quadcore. But I am confused in what changed. Maybe I did it yesterday with smaller vectors? I think not, but I did play with the size. — Vladimir F Героям слава, Apr 14 '23 at 09:10
That is what surprises me, I am using vectors of size 65536, it should accelerate! But I can't see the threads on htop. Can you see them @VladimirFГероямслава ? Is it really threading? — Astor, Apr 15 '23 at 17:11
Same query is raised in Intel Communities. For more information, please refer this thread(https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-BLAS-not-multithreading-zgemv/m-p/1476322#M34461 ) — AlekhyaV - Intel, May 10 '23 at 14:39
@AlekhyaV-Intel Note that Intel is refusing to support this issue since it's an AMD processor. — Astor, May 16 '23 at 20:22

MKL BLAS not multithreading zgemv

0 Answers0

Linked