I am working on an OpenACC computational fluid dynamics code to increase the granularity of computations inside a loop by breaking down the overall computations to bunch of small operations. My final goal is to reduce the amount of registers per threat by split the original complex task to small simpler series of tasks on the GPU.
For instance, I have many formulas to compute for a specific node of the computational domain:
!$acc parallel loop ...
do i=1,n
D1 = s(i+1,1) - s(i-1,1)
D2 = s(i+1,2) - s(i-1,2)
...
R = D1 + D2 + ...
enddo
As you see, I can spread the computation to threads of a block and at the end sum up the results (by reduction) to R. Therefore, I defined an inner parallel loop as follows:
!$acc parallel loop
do i=1,n
!$acc parallel loop ...
do j=1,m
D[j] = s(i+1,j) - s(i-1,j)
end
!$acc parallel loop reduction(+:R)
do j=1,m
R = R + D[j]
enddo
enddo
However, I need to define D as a shared memory for all threads but I don't know actually what is the best way for OpenACC? (I used !$acc cache but I got worse performance). Also I need to send some unchanged data to constant memory and again I don't know how I can.
Is there any efficient way to implement this idea to OpenACC? I really appreciate your help.
Thanks a lot, Behzad