Efficient use of memories in OpenACC

Question

I am working on an OpenACC computational fluid dynamics code to increase the granularity of computations inside a loop by breaking down the overall computations to bunch of small operations. My final goal is to reduce the amount of registers per threat by split the original complex task to small simpler series of tasks on the GPU.

For instance, I have many formulas to compute for a specific node of the computational domain:

!$acc parallel loop ...
do i=1,n
  D1 = s(i+1,1) - s(i-1,1)
  D2 = s(i+1,2) - s(i-1,2)
  ...
  R = D1 + D2 + ...
enddo

As you see, I can spread the computation to threads of a block and at the end sum up the results (by reduction) to R. Therefore, I defined an inner parallel loop as follows:

!$acc parallel loop 
do i=1,n
  !$acc parallel loop ...
  do j=1,m  
    D[j] = s(i+1,j) - s(i-1,j)
  end
  !$acc parallel loop reduction(+:R)
  do j=1,m
    R = R + D[j]  
  enddo
enddo

However, I need to define D as a shared memory for all threads but I don't know actually what is the best way for OpenACC? (I used !$acc cache but I got worse performance). Also I need to send some unchanged data to constant memory and again I don't know how I can.

Is there any efficient way to implement this idea to OpenACC? I really appreciate your help.

Thanks a lot, Behzad

One thing I didn't note in my reply, but you probably mean for the inner loops to be just `acc loop` rather than `acc parallel loop` as the former gives the compiler more information about the loop, while the latter creates a whole new, nested parallel region. — jefflarkin, Sep 04 '15 at 20:21

score 1 · Answer 1 · answered Sep 04 '15 at 20:15

I think what you're looking for is declaring the D array to be gang-private. If you were in C, I'd say that you can just declare it inside the i loop. Since you're in Fortran, try doing the following:

!$acc parallel loop private(D)
do i=1,n
  !$acc loop ...
  do j=1,m  
    D[j] = s(i+1,j) - s(i-1,j)
  end
  !$acc loop reduction(+:R)
  do j=1,m
    R = R + D[j]  
  enddo
enddo

Once it's private to the gangs, you might actually find that the cache directive is more effective.

I was looking at a code doing something similar earlier today and you might also want to try expanding the size of D to be n x m and then splitting the i loop between the two j loops. Right now the compiler will need to insert synchronization between the two j loops, plus it will need to strip-mine the i loop into vectors, so you're probably losing M/vector_length parallelism. If you break it into two doubly-nested loops, you could collapse the i and j loops together and get more parallelism. That would look something like this.

!$acc parallel loop private(D)
do i=1,n collapse(2)
  do j=1,m  
    D(j,i) = s(i+1,j) - s(i-1,j)
  end
enddo
!$acc parallel loop private(D)
do i=1,n collapse(2) reduction(+:R)
  do j=1,m
    R = R + D(j,i)  
  enddo
enddo

The trade-off is that D will require more storage and the cache performance on the CPU will suffer. It may be worth experimenting though to see if the trade-off is worthwhile.

Thank you very much Jeff for your helpful comments. I am working these days on your comments to implement them in my code. Actually, the inner loop containing D is a kind of direction decomposition: D(x,y) = D(x) + D(y) which each D(x) and D(y) can be computed in parallel. Adding this idea to the code, I am dealing with three-level loops (2 for swiping on the computational 2D domain) and the third one on the direction (size=2 in 2D simulation). I am wondering if OpenACC can handle three-loop of size (m*n*2) in three-level parallelism on the GPU? If so, what would be the best mapping strategy? — behzad baghapour, Sep 08 '15 at 19:15
I'd really need to see your 3 loops to comment, but triply (and more) nested loops are very common in OpenACC codes. It's generally best when you can use the collapse directive to give the compiler full freedom of how to map them to the GPU, but I've seen cases where the code operated very effectively without collapsing. If you can post a sample of how your code is operating on the 3 levels of loops, I'm happy to provide suggestions. — jefflarkin, Sep 08 '15 at 20:40

behzad baghapour · Answer 2 · 2015-09-09T12:44:28.620

The previous serial code version was like:

do j=1,n
  do i=1,m
    a_x = s(i+1,j,1) - s(i-1,j,1)                   
    b_x = s(i+1,j,2) - s(i-1,j,2)                   
    c_x = s(i+1,j,3) - s(i-1,j,3)                  
    a_y = s(i,j+1,1) - s(i,j-1,1)                  
    b_y = s(i,j+1,2) - s(i,j-1,2)                  
    c_y = s(i,j+1,3) - s(i,j-1,3) 
    ...
    R1 = b_x + c_y
    R2 = a_x + a_y
    R3 = c_x + b_y
  enddo
enddo

The new approach is to split the computations into x and y direction as follows (serial version):

do j=1,n
  do i=1,m
    do k=1,2
      do m=1,3
        D(m) = s(i+dir(k,1),j+dir(k,2),m) - s(i-dir(k,1),j-dir(k,2),m)
      enddo             
      ...
      R_1(k) = a(k+1)
      R_2(k) = a(1)
      R_3(k) = a(3)
    enddo
    do k=1,2
      R1 = R1 + R_1(k)
      R2 = R2 + R_2(k)
      R3 = R3 + R_3(k)
    enddo
  enddo
enddo

which dir(2,2) = {(1,0),(0,1)} determines the index shift in 2D for each direction (k).

Now I am trying to import this code to GPU by OpenACC.

Efficient use of memories in OpenACC

2 Answers2