This is a long post -- a lot of background before the question. The quick version is that I've tried to use OpenMP on elements of a linked list -- using OpenMP tasks in a way I've seen prescribed elsewhere, but that leads to a significant slowdown. However, I can get a significant speedup if I divide things up differently, but I'm wondering if there's a way to get the first way to work, since it's cleaner/simpler and (I think) it dynamically balances work across the threads.
I've got a reasonably long linked list (can be a couple million elements) of Fortran types (C structures) and -- several times -- I've got to iterate over the list and operate on each of the elements. So, I've got a subroutine (eachPhonon) that takes a subroutine as an argument (srt) and operates that on each element of the list:
subroutine eachPhonon(srt)
external :: srt
type(phonon), pointer :: tptr
tptr => head
do while(associated(tptr))
call srt(tptr)
tptr => tptr%next
enddo
endsubroutine
It seems like this is a good place for a parallel speedup, since each call of srt can be done independently of the others. This would be very simple using openmp if I had a Fortran do (C for) loop. However, I've seen a method for how to do it using a linked list, both on stackoverflow and from intel. Basically, it makes each call to srt it's own task -- something like:
subroutine eachPhonon(srt)
external :: srt
type(phonon), pointer :: tptr
tptr => head
!$OMP PARALLEL
!$OMP SINGLE
do while(associated(tptr))
!$OMP TASK FIRSTPRIVATE(tptr)
call srt(tptr)
!$OMP END TASK
tptr => tptr%next
enddo
!$OMP END SINGLE
!$OMP END PARALLEL
endsubroutine
This seems to work, but it's significantly slower than using just one thread.
I rewrote things so that given, say, 4 threads, one thread would operate on elements 1,5,9..., another on elements 2,6,10..., etc.:
subroutine everyNth(srt, tp, n)
external :: srt
type(phonon), pointer :: tp
integer :: n, j
do while(associated(tp))
call srt(tp)
do j=1,n
if(associated(tp)) tp => tp%next
enddo
enddo
endsubroutine
subroutine eachPhononParallel(srt)
use omp_lib
external :: srt
type(phonon), pointer :: tp
integer :: j, nthreads
!$OMP PARALLEL
!$OMP SINGLE
nthreads = OMP_GET_NUM_THREADS()
tp => head
do j=1,nthreads
!$OMP TASK FIRSTPRIVATE(tp)
call everyNth(srt, tp, nthreads)
!$OMP END TASK
tp => tp%next
enddo
!$OMP END SINGLE
!$OMP END PARALLEL
endsubroutine
This can lead to a significant speedup.
Is there a way to make the first method efficient?
I'm new to parallel processing, but my reading is that the first method has too much overhead since it tries to make a task for each element. The second way only makes one task for each thread and avoids that overhead. The downside is somewhat less clean code that can't be compiled without openmp, and it won't dynamically balance work across the threads -- it's all statically assigned at the beginning.