Allocating Multiple Threads to Single Parallel Do on Xeon Phi with Open MP

Question

I have some code similar to this:

!$dir parallel do
do k = 1, NUM_JOBS
  call asynchronous_task( parameter_array(k) )
end do
!$dir end parallel do

I've tried many different strategies, including

$ micnativeloadex $exe -e "KMP_PLACE_THREADS=59Cx4T OMP_NUM_THREADS=236"

But, when I check the MIC with top, I'm only getting 25% usage.

I'm having a great deal of difficultly finding any specific help in the Intel docs/forums and OpenMP forums, and now I'm thinking that my only shot at having 59 tasks with 4 threads working on each task is to combine open-MPI with open-MP.

Does anyone have any experience with this and have any recommendations for moving forward? I've been running 236 asynchronous tasks instead, but I have a suspicion that 59 tasks will run over 4 times faster than 236 due to the memory overhead of my task.

Are you sure that your 25% usage is because you are not getting all 236 threads you are asking for? Do you have access to VTune, to see if you are getting a lot of cache misses? Running a large number of asynchronous tasks is probably not getting you much cache reuse between threads, meaning you might get better performance with 1 or two threads per core. If you can provide some more information, I'll see if I can find someone to give you more help. — froth, Nov 10 '14 at 19:45

amckinley · Accepted Answer · 2014-11-09T18:25:42.777

KMP_PLACE_THREADS will set OMP_NUM_THREADS implicitly, so you don't need to specify this in your mic environment variables.

If you would like to use 59 tasks with 4 threads per task you have a few options.

MPI/OpenMP

As you mentioned, you could use a hybrid MPI/OpenMP approach. In this case you will utilise a different OpenMP domain per rank. I achieved this in the past running mpirun natively on the mic something like this:

#!/bin/bash
export I_MPI_PIN=off
mpirun -n 1 -env KMP_PLACE_THREADS=10c,4t,1o ./scaling : \
-n 1 -env KMP_PLACE_THREADS=10c,4t,11o ./scaling : \
-n 1 -env KMP_PLACE_THREADS=10c,4t,21o ./scaling : \
-n 1 -env KMP_PLACE_THREADS=10c,4t,31o ./scaling : \
-n 1 -env KMP_PLACE_THREADS=10c,4t,41o ./scaling : \
-n 1 -env KMP_PLACE_THREADS=10c,4t,51o ./scaling

This will create 6 MPI ranks, with the threads explicitly placed at CPU 1,11,21,31,41,51 with 40 OpenMP threads per rank.

You will have to design your MPI code to split the NUM_JOBS over your ranks and use OpenMP internally inside your asynchronous_task().

Nested OpenMP

The other possibility to use used nested OpenMP. This will almost certainly be more advantageous for total memory consumption on the Xeon Phi.

In this case, you will also need to expose parallelism inside your asynchronous_task using OpenMP directives.

At the top level loop you can start 59 tasks and then use 4 threads internally in asynchronous_task. It is critical that you can expose this parallelism internally or your performance will not scale well.

To use nested OpenMP you can use something like this:

call omp_set_nested(.true.)

!$OMP parallel do NUM_THREADS(59)
do k = 1, NUM_JOBS
  call asynchronous_task( parameter_array(k) )
end do
!$OMP end parallel do

subroutine asynchronous_task()
!$OMP parallel NUM_THREADS(4)
   work()
!$OMP end parallel
end subroutine

In both use cases, you will need to utilise OpenMP inside your task subroutine, in order to use more than one thread per task.

Allocating Multiple Threads to Single Parallel Do on Xeon Phi with Open MP

1 Answers1

MPI/OpenMP

Nested OpenMP