While this post is a bit dated, I would still like to give some useful insights for it.
The above answer is correct from a function perspective, but will not give best results from a performance perspective. The reason is that most OpenMP implementations do not shutdown the threads when they reach a barrier or don't have work to do. Instead, the threads will enter a spin-wait loop and continue to consume processor cycles while they are waiting.
In the example:
#pragma omp parallel
{
#omp for nowait
for(...) {} // first loop
#omp for
for(...) {} // second loop
#pragma omp single
dgemm_(....)
#pragma omp for
for(...) {} // third loop
}
What will happen is that even if the dgemm
call creates additional threads inside MKL, the outer-level threads will still be actively waiting for the end of the single
construct and thus dgemm
will run with reduced performance.
There are essentially two solutions to this problem:
1) List item Use the code as above and in addition to the suggested environment variables also disable active waiting:
$ MKL_DYNAMIC=FALSE MKL_NUM_THREADS=8 OMP_NUM_THREADS=8 OMP_NESTED=TRUE OMP_WAIT_MODE=passive ./exe
2) Modify the code to split the parallel regions:
#pragma omp parallel
{
#omp for nowait
for(...) {} // first loop
#omp for nowait
for(...) {} // second loop
}
dgemm_(...);
#pragma omp parallel
#pragma omp for nowait
for(...) {} // third loop
}
For solution 1, the threads go to the sleep mode immediately and do not consume cycles. The downside is that the thread has to wake up from this deeper sleep state, which will increase the latency compared to the spin-wait.
For solution 2, the threads are kept in their spin-wait loop and are very likely actively waiting when the dgemm
call enters its parallel region. The additional joins and forks will also introduce some overhead, but it may be better than the over-subscription of the initial solution with the single
construct or solution 1.
What is the best solution will clear depend on the amount of work being done in the dgemm
operation compared to the synchronization overhead for fork/join, which in mostly dominated by the thread count and the internal implementation.