KMP_PLACE_THREADS will set OMP_NUM_THREADS implicitly, so you don't need to specify this in your mic environment variables.
If you would like to use 59 tasks with 4 threads per task you have a few options.
MPI/OpenMP
As you mentioned, you could use a hybrid MPI/OpenMP approach. In this case you will utilise a different OpenMP domain per rank. I achieved this in the past running mpirun natively on the mic something like this:
#!/bin/bash
export I_MPI_PIN=off
mpirun -n 1 -env KMP_PLACE_THREADS=10c,4t,1o ./scaling : \
-n 1 -env KMP_PLACE_THREADS=10c,4t,11o ./scaling : \
-n 1 -env KMP_PLACE_THREADS=10c,4t,21o ./scaling : \
-n 1 -env KMP_PLACE_THREADS=10c,4t,31o ./scaling : \
-n 1 -env KMP_PLACE_THREADS=10c,4t,41o ./scaling : \
-n 1 -env KMP_PLACE_THREADS=10c,4t,51o ./scaling
This will create 6 MPI ranks, with the threads explicitly placed at CPU 1,11,21,31,41,51 with 40 OpenMP threads per rank.
You will have to design your MPI code to split the NUM_JOBS over your ranks and use OpenMP internally inside your asynchronous_task()
.
Nested OpenMP
The other possibility to use used nested OpenMP. This will almost certainly be more advantageous for total memory consumption on the Xeon Phi.
In this case, you will also need to expose parallelism inside your asynchronous_task
using OpenMP directives.
At the top level loop you can start 59 tasks and then use 4 threads internally in asynchronous_task
. It is critical that you can expose this parallelism internally or your performance will not scale well.
To use nested OpenMP you can use something like this:
call omp_set_nested(.true.)
!$OMP parallel do NUM_THREADS(59)
do k = 1, NUM_JOBS
call asynchronous_task( parameter_array(k) )
end do
!$OMP end parallel do
subroutine asynchronous_task()
!$OMP parallel NUM_THREADS(4)
work()
!$OMP end parallel
end subroutine
In both use cases, you will need to utilise OpenMP inside your task subroutine, in order to use more than one thread per task.