How to efficiently parallelize a linked list using OpenMP (using tasks?)

Question

This is a long post -- a lot of background before the question. The quick version is that I've tried to use OpenMP on elements of a linked list -- using OpenMP tasks in a way I've seen prescribed elsewhere, but that leads to a significant slowdown. However, I can get a significant speedup if I divide things up differently, but I'm wondering if there's a way to get the first way to work, since it's cleaner/simpler and (I think) it dynamically balances work across the threads.

I've got a reasonably long linked list (can be a couple million elements) of Fortran types (C structures) and -- several times -- I've got to iterate over the list and operate on each of the elements. So, I've got a subroutine (eachPhonon) that takes a subroutine as an argument (srt) and operates that on each element of the list:

subroutine eachPhonon(srt)
  external :: srt
  type(phonon), pointer :: tptr

  tptr => head

  do while(associated(tptr))
    call srt(tptr)
    tptr => tptr%next
  enddo
endsubroutine

It seems like this is a good place for a parallel speedup, since each call of srt can be done independently of the others. This would be very simple using openmp if I had a Fortran do (C for) loop. However, I've seen a method for how to do it using a linked list, both on stackoverflow and from intel. Basically, it makes each call to srt it's own task -- something like:

subroutine eachPhonon(srt)
  external :: srt
  type(phonon), pointer :: tptr

  tptr => head

  !$OMP PARALLEL
  !$OMP SINGLE    
    do while(associated(tptr))
      !$OMP TASK FIRSTPRIVATE(tptr)
        call srt(tptr)
      !$OMP END TASK
      tptr => tptr%next
    enddo
  !$OMP END SINGLE
  !$OMP END PARALLEL
endsubroutine

This seems to work, but it's significantly slower than using just one thread.

I rewrote things so that given, say, 4 threads, one thread would operate on elements 1,5,9..., another on elements 2,6,10..., etc.:

subroutine everyNth(srt, tp, n)
  external :: srt

  type(phonon), pointer :: tp
  integer :: n, j

  do while(associated(tp))
    call srt(tp)

    do j=1,n
      if(associated(tp)) tp => tp%next
    enddo
  enddo
endsubroutine

subroutine eachPhononParallel(srt)
  use omp_lib
  external :: srt

  type(phonon), pointer :: tp
  integer :: j, nthreads

  !$OMP PARALLEL
  !$OMP SINGLE
    nthreads = OMP_GET_NUM_THREADS()
    tp => head
    do j=1,nthreads
      !$OMP TASK FIRSTPRIVATE(tp)
        call everyNth(srt, tp, nthreads)
      !$OMP END TASK
      tp => tp%next
    enddo
  !$OMP END SINGLE
  !$OMP END PARALLEL
endsubroutine

This can lead to a significant speedup.

Is there a way to make the first method efficient?

I'm new to parallel processing, but my reading is that the first method has too much overhead since it tries to make a task for each element. The second way only makes one task for each thread and avoids that overhead. The downside is somewhat less clean code that can't be compiled without openmp, and it won't dynamically balance work across the threads -- it's all statically assigned at the beginning.

Massimiliano · Accepted Answer · 2013-06-11T18:06:05.767

If the granularity of your parallelism is too fine, you may try to operate on chunks of a bigger size:

subroutine eachPhonon(srt,chunksize)
  external            :: srt
  integer, intent(in) :: chunksize

  type(phonon), pointer :: tptr

  tptr => head

  !$OMP PARALLEL
  !$OMP SINGLE    
    do while(associated(tptr))
      !$OMP TASK FIRSTPRIVATE(tptr)
        ! Applies srt(tptr) chunksize times or until 
        ! associated(tptr)
        call chunk_srt(tptr,chunksize) 
      !$OMP END TASK
      ! Advance tptr chunksize times if associated(tptr)
      advance(tprt,chunksize) 
    enddo
  !$OMP END SINGLE
  !$OMP END PARALLEL
endsubroutine

The idea is to set chunksize to a value big enough to mask the overhead that is associated with task creation.

score 2 · Answer 2 · answered Jun 11 '13 at 19:05

The slowdown means that srt() takes too little time to execute and therefore the overhead swamps the possible parallel speed-up. Besides Massimiliano's solution, you can also convert the linked list into an array of pointers and then use PARALLEL DO on the resultant structure:

type phononptr
  type(phonon), pointer :: p
endtype phononptr

...

subroutine eachPhonon(srt)
  external :: srt
  type(phonon), pointer :: tptr
  type(phononptr), dimension(:), allocatable :: ptrs
  integer :: i

  allocate(ptrs(numphonons))

  tptr => head
  i = 1

  do while(associated(tptr))
    ptrs(i)%p => tptr
    i = i + 1
    tptr => tptr%next
  enddo

  !$OMP PARALLEL DO SCHEDULE(STATIC)
  do i = 1, numphonons
    call srt(ptrs(i)%p)
  enddo
  !$OMP END PARALLEL DO

endsubroutine

If you do not explicitly keep the number of list items in a separate variable (numphonons in this case), you would have to traverse the list twice. The phononptr type is neccessary because Fortran lacks an easier way to declare arrays of pointers.

The same can also be achieved by setting chunksize in Massimiliano's solution to numphonons / omp_get_num_threads().

How to efficiently parallelize a linked list using OpenMP (using tasks?)

2 Answers2