I have implemented Conjugate Gradient in FORTRAN by replacing the Linear Algebra subroutines in the wikipedia example by (Fortran) Intel MKL subroutines. (DGEMV, DAXPY and DNRM only. It turns out that a=b is faster than DCOPY and a=2*a is faster than DSCAL)
The answers are correct and there is no problem with the implementation. However, when I compile it as ifort CG.f90 -mkl
The results are :
MKL_SET_DYNAMIC = TRUE ; 140 seconds
MKL_SET_DYNAMIC = FALSE, MKL_SET_NUM_THREADS=1 ; 70 seconds.
MKL_SET_DYNAMIC = FALSE, MKL_SET_NUM_THREADS=2 ; ~100 seconds.
A few points:
- I have 2 real cores and 2 virtual cores through hyperthreading. I am not trying to run 16 threads on a 2 core machine.
- Profiling has yielded abstruse references to a
M16_LAY_GAS16
which after a lot of searching came down tomultpd
ASM. Nothing useful came out otherwise (or maybe, I didn't know where to look) FWIW, I used VTune. - The problem size is not small. The above examples are for matrix sizes proportional to the size of my RAM (Roughly 13k x 13k for my 4 GB System).
KMP_AFFINITY
maps one thread to one processor in serial case and 2 threads to 2 processors in parallel.
My question is : Why isn't MKL_DYNAMIC setting number of threads as 1 if that is optimal? I don't necessarily need to use 2 threads if the same work (in lesser time) is done by 1.
Am I doing something wrong or is something wrong with Intel MKL?