I am attempting to implement OpenACC on some Fortran code I have. The code consists of an outer time stepping loop (which cannot be parallelized) and within the loop there are a number of nested loops. These nested loops can be parallelized but need to be run in order (i.e. A followed by B followed by C).
I want to offload this entire process to the GPU since data transfer over many timesteps between the GPU and CPU becomes a prohibitive penalty. The pseudocode below illustrates my current approach:
!$acc data copy(ALL THE DATA THAT I NEED)
DO iter = 1, NT
value = array_of_values(iter)
!$ acc kernels
!PART A
!$acc loop independent, private(j)
DO J=0,ymax
!$acc loop independent, private(i)
DO I=0,xmaxput
!$acc loop independent, private(l)
DO L=0,zmax
if(value == 0) then
(DO SOME COMPUTATIONS...)
elseif(value < 0) then
(DO SOME OTHER COMPUTATIONS...)
elseif(value > 0) then
(DO SOME OTHER COMPUTATIONS...)
endif
ENDDO
ENDDO
ENDDO
!NOW GO DO OTHER STUFF
!PART B
!$acc loop independent, private(j)
DO J=0,ymax
!$acc loop independent, private(i)
DO I=0,xmax
!$acc loop independent, private(l)
DO L=0,zmax
(DO SOME EVEN MORE COMPUTATIONS...)
ENDDO
ENDDO
ENDDO
!PART C
!etc...
!$acc end kernels
ENDDO
!$acc end data
I have working code using this approach, however, when I profile it on a GeForce MX150 GPU using NVIDIA's Visual Profiler (click for image) I see that every iteration through the time stepping loop leads to large time gaps where no computations are being done. The Driver API says during this time it is performing "cuLaunchKernel". If I simply copy the entire loop so that two iterations run each timestep instead of one then this gap does not exist within the time stepping loop, only when the loop begins.
I have a few (interrelated) questions:
1. Is there a way to get these kernels to be launched while other kernels are running?
2. I've read here and here that the WDDM driver batches kernel launching which appears to be happening here. Does this mean that if I were to run on Linux I should not expect this behavior?
cuStreamSynchronize appears to also be blocking the GPU from running, leading to additional null time. This seems related with the question of how to get other kernels to launch prior to the end of the time stepping loop.
This is my first time using OpenACC. I have looked all over for an answer to this, but am probably using the wrong keywords as I have not been able to find anything.
EDIT - solution
Per Mat's suggestion, I added Async which solved the issue. Interestingly, kernel launches are still done at the same time but now every kernel that will be launched while iterating through the time stepping loop is launch at once at the beginning of the program. The updated pseudo code is below along with a few other tweaks should it ever be helpful to anyone else:
!$acc data copy(ALL THE DATA THAT I NEED)
!$acc wait
DO iter = 1, NT
value = array_of_values(iter)
!$acc update device(value, iter), async(1) !allows loop to run on cpu and sends value to GPU
!PART A
!$acc kernels, async(1)
!$acc loop independent, private(j)
DO J=0,ymax
!$acc loop independent, private(i)
DO I=0,xmaxput
!$acc loop independent, private(l)
DO L=0,zmax
if(value == 0) then
(DO SOME COMPUTATIONS...)
elseif(value < 0) then
(DO SOME OTHER COMPUTATIONS...)
elseif(value > 0) then
(DO SOME OTHER COMPUTATIONS...)
endif
ENDDO
ENDDO
ENDDO
!$acc end kernels
!NOW GO DO OTHER STUFF
!PART B
!$acc kernels, async(1)
!$acc loop independent, private(j)
DO J=0,ymax
!$acc loop independent, private(i)
DO I=0,xmax
!$acc loop independent, private(l)
DO L=0,zmax
(DO SOME EVEN MORE COMPUTATIONS...)
ENDDO
ENDDO
ENDDO
!$acc end kernels
!PART C
!$acc kernels, async(1)
!for loops, etc...
!$acc end kernels
ENDDO
!$acc wait
!$acc end data