Reducing time to launch kernels in time stepping loop - OpenACC

Question

I am attempting to implement OpenACC on some Fortran code I have. The code consists of an outer time stepping loop (which cannot be parallelized) and within the loop there are a number of nested loops. These nested loops can be parallelized but need to be run in order (i.e. A followed by B followed by C).

I want to offload this entire process to the GPU since data transfer over many timesteps between the GPU and CPU becomes a prohibitive penalty. The pseudocode below illustrates my current approach:

!$acc data copy(ALL THE DATA THAT I NEED)
DO iter = 1, NT
    value = array_of_values(iter)
    !$ acc kernels
!PART A
    !$acc loop independent, private(j)
        DO J=0,ymax
    !$acc loop independent, private(i)
        DO I=0,xmaxput
    !$acc loop independent, private(l)
        DO L=0,zmax
            if(value == 0) then 
                (DO SOME COMPUTATIONS...)
            elseif(value < 0) then 
                (DO SOME OTHER COMPUTATIONS...)
            elseif(value > 0) then 
                (DO SOME OTHER COMPUTATIONS...)
            endif
        ENDDO
        ENDDO
        ENDDO

        !NOW GO DO OTHER STUFF
!PART B
    !$acc loop independent, private(j)
        DO J=0,ymax
    !$acc loop independent, private(i)
        DO I=0,xmax
    !$acc loop independent, private(l)
        DO L=0,zmax
            (DO SOME EVEN MORE COMPUTATIONS...)
        ENDDO
        ENDDO
        ENDDO

!PART C
!etc...

    !$acc end kernels 
ENDDO
!$acc end data

I have working code using this approach, however, when I profile it on a GeForce MX150 GPU using NVIDIA's Visual Profiler (click for image) I see that every iteration through the time stepping loop leads to large time gaps where no computations are being done. The Driver API says during this time it is performing "cuLaunchKernel". If I simply copy the entire loop so that two iterations run each timestep instead of one then this gap does not exist within the time stepping loop, only when the loop begins.

I have a few (interrelated) questions:
1. Is there a way to get these kernels to be launched while other kernels are running?
2. I've read here and here that the WDDM driver batches kernel launching which appears to be happening here. Does this mean that if I were to run on Linux I should not expect this behavior?

cuStreamSynchronize appears to also be blocking the GPU from running, leading to additional null time. This seems related with the question of how to get other kernels to launch prior to the end of the time stepping loop.

This is my first time using OpenACC. I have looked all over for an answer to this, but am probably using the wrong keywords as I have not been able to find anything.

EDIT - solution

Per Mat's suggestion, I added Async which solved the issue. Interestingly, kernel launches are still done at the same time but now every kernel that will be launched while iterating through the time stepping loop is launch at once at the beginning of the program. The updated pseudo code is below along with a few other tweaks should it ever be helpful to anyone else:

!$acc data copy(ALL THE DATA THAT I NEED)
!$acc wait
DO iter = 1, NT
    value = array_of_values(iter)
    !$acc update device(value, iter), async(1)    !allows loop to run on cpu and sends value to GPU

!PART A
    !$acc kernels, async(1)
    !$acc loop independent, private(j)
        DO J=0,ymax
    !$acc loop independent, private(i)
        DO I=0,xmaxput
    !$acc loop independent, private(l)
        DO L=0,zmax
            if(value == 0) then 
                (DO SOME COMPUTATIONS...)
            elseif(value < 0) then 
                (DO SOME OTHER COMPUTATIONS...)
            elseif(value > 0) then 
                (DO SOME OTHER COMPUTATIONS...)
            endif
        ENDDO
        ENDDO
        ENDDO
    !$acc end kernels   

!NOW GO DO OTHER STUFF
!PART B
    !$acc kernels, async(1) 
    !$acc loop independent, private(j)
        DO J=0,ymax
    !$acc loop independent, private(i)
        DO I=0,xmax
    !$acc loop independent, private(l)
        DO L=0,zmax
            (DO SOME EVEN MORE COMPUTATIONS...)
        ENDDO
        ENDDO
        ENDDO
    !$acc end kernels   

!PART C
    !$acc kernels, async(1)
        !for loops, etc...
    !$acc end kernels 
ENDDO
!$acc wait
!$acc end data

Mat Colgrove · Accepted Answer · 2019-01-25T21:32:28.180

When you say "large time gap", can you be more specific? Are you taking seconds, microseconds, milliseconds? While it varies a lot, I would expect kernel launch overhead to be around 40 microseconds. Often launch overhead gets lost in the noise, but if the kernels are particularly fast or if the kernel is launch millions of times, then it can effect the relative performance. Using "async" clauses can help to hide the launch overhead (see below).

Though if the gaps are much larger, then there might be something else going. For example if there's a reduction in the loop, the reduction variable may be being copied back to the host. Note if you're using PGI, take a look at the compiler feedback messages (-Minfo=accel). This may give some clues on what's going on.

Is there a way to get these kernels to be launched while other kernels are running?

Yes. Use three separate "kernels" regions, one on each part. Then add "async(1)" clause to each compute region. Async will have the host continue after launching the kernels and since they use the same queue, 1 in this case but you could use any positive integer, it will create a dependency so B wont run until A is finished, and C will start after B. You'll want to add a "!$acc wait" where you want the host to synchronize with the device.

Note under the hood, async queues map to a CUDA stream.

cuStreamSynchronize appears to also be blocking the GPU from running, leading to additional null time. This seems related with the question of how to get other kernels to launch prior to the end of the time stepping loop.

This is the time the host is blocked waiting for the GPU compute to finish. It should be about the same as your kernel run time (if not using async).

The gap was about 100 microseconds, which was about 20% of each timestep, and considering there are thousands of timesteps it added up quickly. Using Async solved the issue. Thanks! — Noel, Jan 26 '19 at 03:28

Daniel Bauer · Answer 2 · 2019-01-25T10:09:24.103

0

You can estimate what difference the WDDM makes by having you GPU run without a display. If no displays are connected to it, it removes some of the problems of running with WDDM.

Try connecting your display(s) to your mainboard, if it supports simultaneous drivers for integrated graphics and GPU (check your BIOS).

Otherwise, you could add another GPU into your PC (maybe you an old one lying around) and use that for your displays.

edited Jan 25 '19 at 10:09

answered Jan 25 '19 at 09:33

Daniel Bauer

478
3
16

The GPU doesn't automatically switch from WDDM to TCC mode if no display is connected. Use nvidia-smi to change the mode from WDDM to TCC provided your GPU supports it (Tesla cards do). – tera Jan 25 '19 at 09:39
@tera it's not a Tesla, its a GeForce Card. – Daniel Bauer Jan 25 '19 at 09:44
In that case TCC is not an option unfortunately. But Switching to Linux is. – tera Jan 25 '19 at 09:48
The solution i posted removed similar issues for me, so maybe it will work here too. – Daniel Bauer Jan 25 '19 at 09:51
Having a display connected might introduce additional periodic delays when the GPU is busy drawing screen content, and these can certainly be avoided by disconnecting the display. But the driver will still remain in WDDM mode and incur related penalties, which could be avoid by swithing to TCC mode or Linux. – tera Jan 25 '19 at 10:03
edited my answer. It's still worth trying to see if switching to Linux is worth it – Daniel Bauer Jan 25 '19 at 10:15
I am going to switch eventually, just testing the code on my laptop now but debugging on the server I will be using is a pain. It looks like I was able to solve the issue using the Async option recommended by Mat. – Noel Jan 26 '19 at 03:30

Reducing time to launch kernels in time stepping loop - OpenACC

2 Answers2