The taskloop
construct by default has an implicit taskgroup
around it. With that in mind, what happens for your code is that the single
constructs picks any one thread out of the available threads of the parallel team (I'll call that the producer thread). The n-1 other threads are then send straight to the barrier of the single
construct and ware waiting for work to arrive (the tasks).
Now with the taskgroup
what happens is that producer thread kicks off the creation of the loop tasks, but then waits at the end of the taskloop
construct for all the created tasks to finish:
!$omp parallel
!$omp single
!$omp taskloop num_tasks(10)
DO i=1, 10
A(i) = foo()
END DO
!$omp end taskloop ! producer waits here for all loop tasks to finish
!do other stuff
!$omp taskloop
DO j=1, 10
B(j) = A(j)
END DO
!$omp end taskloop ! producer waits here for all loop tasks to finish
!$omp end single
!$omp end parallel
So, if you have less parallelism (= number of tasks created by the first taskloop
) than the n-1 worker threads in the barrier, then some of these threads will idle.
If you want more overlap and if the "other stuff" is independent of the first taskloop
, then you can do this:
!$omp parallel
!$omp single
!$omp taskgroup
!$omp taskloop num_tasks(10) nogroup
DO i=1, 10
A(i) = foo()
END DO
!$omp end taskloop ! producer will not wait for the loop tasks to complete
!do other stuff
!$omp end taskgroup ! wait for the loop tasks (and their descendant tasks)
!$omp taskloop
DO j=1, 10
B(j) = A(j)
END DO
!$omp end taskloop
!$omp end single
!$omp end parallel
Alas, the OpenMP API as of version 5.1 does not support task dependences for the taskloop construct, so you cannot easily describe the dependency between the loop iterations of the first taskloop
and the second taskloop
. The OpenMP language committee is working on this right now, but I do not see this being implemented for the OpenMP API version 5.2, but rather for version 6.0.
PS (EDIT): For the second taskloop
as it's right before the end of the single
construct and thus right before a barrier, you can easily add the nogroup
there as well to avoid that extra bit of waiting for the producer thread.