OPENMP F90/95 Nested DO loops - problems getting improvement over serial implementation

Question

I've done some searching but couldn't find anything that appeared to be related to my question (sorry if my question is redundant!). Anyway, as the title states, I'm having trouble getting any improvement over the serial implementation of my code. The code snippet that I need to parallelize is as follows (this is Fortran90 with OpenMP):

do n=1,lm     
  do m=1,jm   
    do l=1,im      
      sum_u = 0
      sum_v = 0
      sum_t = 0
      do k=1,lm
       !$omp parallel do reduction (+:sum_u,sum_v,sum_t) 
        do j=1,jm  
          do i=1,im
            exp_smoother=exp(-(abs(i-l)/hzscl)-(abs(j-m)/hzscl)-(abs(k-n)/vscl))
            sum_u = sum_u + u_p(i,j,k) * exp_smoother
            sum_v = sum_v + v_p(i,j,k) * exp_smoother
            sum_t = sum_t + t_p(i,j,k) * exp_smoother

            sum_u_pert(l,m,n) = sum_u
            sum_v_pert(l,m,n) = sum_v
            sum_t_pert(l,m,n) = sum_t          

            end do
          end do
       end do      
    end do
  end do  
end do

Am I running into race condition issues? Or am I simply putting the directive in the wrong place? I'm pretty new to this, so I apologize if this is an overly simplistic problem.

Anyway, without parallelization, the code is excruciatingly slow. To give an idea of the size of the problem, the lm, jm, and im indexes are 60, 401, and 501 respectively. So the parallelization is critical. Any help or links to helpful resources would be very much appreciated! I'm using xlf to compile the above code, if that's at all useful.

Thanks! -Jen

score 3 · Accepted Answer · answered Jun 06 '11 at 22:01

The obvious place to put the omp pragma is at the very outside loop.

For every (l,m,n), you're calculating a convolution between your perturbed variables and an exponential smoother. Each (l,m,n) calculation is completely independant from the others, so you can put it on the outermost loop. So for instance the simplest thing

!$omp parallel do private(n,m,l,i,j,k,exp_smoother) shared(sum_u_pert,sum_v_pert,sum_t_pert,u_p,v_p,t_p), default(none)
do n=1,lm
  do m=1,jm
    do l=1,im
      do k=1,lm
        do j=1,jm
          do i=1,im
            exp_smoother=exp(-(abs(i-l)/hzscl)-(abs(j-m)/hzscl)-(abs(k-n)/vscl))
            sum_u_pert(l,m,n) = sum_u_pert(l,m,n) + u_p(i,j,k) * exp_smoother
            sum_v_pert(l,m,n) = sum_v_pert(l,m,n) + v_p(i,j,k) * exp_smoother
            sum_t_pert(l,m,n) = sum_t_pert(l,m,n) + t_p(i,j,k) * exp_smoother
          end do
        end do
      end do
    end do
  end do
end do

gives me a ~6x speedup on 8 cores (using a much reduced problem size of 20x41x41). Given the amount of work there is to do in the loops, even at the smaller size, I assume the reason it's not an 8x speedup involves memory contension or false sharing; for further performance tuning you might want to explicitly break the sum arrays into sub-blocks for each thread, and combine them at the end; but depending on the problem size, having the equivalent of an extra im x jm x lm sized array might not be desirable.

It seems like there's a lot of structure in this problem you aught to be able to explot to speed up even the serial case, but it's easier to say that then to find it; playing around on pen and paper nothing comes to mind in a few minutes, but someone cleverer may spot something.

Thanks! This was very helpful. I'm giving this a try at the moment. Unfortunately I still have a ways to go to get the performance level that I need. I have access to a pretty big machine (64 cores per processor) - but perhaps I may need to combine this with MPI to get the necessary performance. — JRC, Jun 06 '11 at 22:40
The downside of MPI here would likely be extra copies of some of the arrays, but the calculations are so independant that, if the decomposition across MPI makes sense for the rest of your problem, it should be pretty easy. xlf + 64 cores == Power system? Make sure you're using -O5, -qhot, and the appropriate -qarch flag and let the compiler really go to town on this one. With 64 cores and the outermost loop only having ~60 iterations, you may find it advantageous to add a collapse(2) or collapse(3) clause to the omp do line. — Jonathan Dursi, Jun 06 '11 at 23:50
You're right on the money with the Power system. I'll give the collapse clauses a shot tomorrow in addition to the compiler flags. I also realized it's unnecessary to be doing the calculation across the entire grid (for my problem) - so that should speed things up considerably as well. Thanks again! I had no idea stack overflow was such a great resource! — JRC, Jun 07 '11 at 03:15

deprecated · Answer 2 · 2011-06-07T02:23:04.637

2

What you have is a convolution. This can be done with a Fast Fourier Transform in N log2(N) time. Your algorithm is N^2. If you use FFT, one core will probably be enough!

edited Jun 07 '11 at 02:23

answered Jun 06 '11 at 23:59

deprecated

2,030
16
11

OPENMP F90/95 Nested DO loops - problems getting improvement over serial implementation

2 Answers2

Linked