Serial version slower than parallel with MKL_DYNAMIC=TRUE

Question

I have implemented Conjugate Gradient in FORTRAN by replacing the Linear Algebra subroutines in the wikipedia example by (Fortran) Intel MKL subroutines. (DGEMV, DAXPY and DNRM only. It turns out that a=b is faster than DCOPY and a=2*a is faster than DSCAL)

The answers are correct and there is no problem with the implementation. However, when I compile it as ifort CG.f90 -mkl The results are :

MKL_SET_DYNAMIC = TRUE ; 140 seconds

MKL_SET_DYNAMIC = FALSE, MKL_SET_NUM_THREADS=1 ; 70 seconds.

MKL_SET_DYNAMIC = FALSE, MKL_SET_NUM_THREADS=2 ; ~100 seconds.

A few points:

I have 2 real cores and 2 virtual cores through hyperthreading. I am not trying to run 16 threads on a 2 core machine.
Profiling has yielded abstruse references to a M16_LAY_GAS16 which after a lot of searching came down to multpd ASM. Nothing useful came out otherwise (or maybe, I didn't know where to look) FWIW, I used VTune.
The problem size is not small. The above examples are for matrix sizes proportional to the size of my RAM (Roughly 13k x 13k for my 4 GB System).
KMP_AFFINITY maps one thread to one processor in serial case and 2 threads to 2 processors in parallel.

My question is : Why isn't MKL_DYNAMIC setting number of threads as 1 if that is optimal? I don't necessarily need to use 2 threads if the same work (in lesser time) is done by 1.

Am I doing something wrong or is something wrong with Intel MKL?

I am curious why you haven't mentioned `DGEMV` or `DSYMV`.... — talonmies, Apr 16 '12 at 05:58

talonmies · Accepted Answer · 2012-04-16T08:28:06.547

3

MKL_DYNAMIC is functionally the same as OMP_DYNAMIC/omp_set_dynamic() from the OpenMP standard.

It doesn't mean "magically change the number of threads to run the code as fast as possible". It means that the runtime can, under some circumstances, change the number of threads from the user specified value or the system default, if there are system resource or other implementation specific reasons to do so. Given you haven't specified a number of threads and there are 4 concurrent hardware threads available, I would guess that your MKL_SET_DYNAMIC = TRUE case is using four threads.

If you ran something like MKL_SET_DYNAMIC=TRUE MKL_SET_NUM_THREADS=16 you might find that the runtime throttles the thread count down to 4 and the performance would be better than MKL_SET_DYNAMIC=FALSE MKL_SET_NUM_THREADS=16, because the runtime might detect you are asking for more than the number of available concurrent hardware threads. But that is all I would expect it to do.

edited Apr 16 '12 at 08:28

answered Apr 16 '12 at 05:18

talonmies

70,661
34
192
269

Running with `MKL_NUM_THREADS=16` with `MKL_DYNAMIC=TRUE` does throttle it to 2 (Turns out 2 is optimal for the problem) while with FALSE goes for 4 threads. Isn't there a mechanism to "magically change number of threads to optimal value"? – Apr 16 '12 at 13:54
@Nunoxic: unfortunately neither the compiler OpenMP runtime are psychic, they can't know *a prori* what settings to use to run code optimally. – talonmies Apr 16 '12 at 14:54
That's interesting. I'll explore how MKL_DYNAMIC how much to allocate. Thanks a lot! – Apr 16 '12 at 16:30

Serial version slower than parallel with MKL_DYNAMIC=TRUE

1 Answers1