Slow GPU performance on OpenCL kernel

Question

I'm kinda at a lost with some performance with OpenCL on an AMD GPU (Hawaii core or Radeon R9 390).

The operation is as follows:

send memory object #1 to GPU
execute kernel #1
send memory object #2 to GPU
execute kernel #2
send memory object #3 to GPU
execute kernel #3

dependency is:

kernel #1 on memory object #1
kernel #2 on memory object #2 as well as output memory of kernel #1
kernel #3 on memory object #3 as well as output memory of kernels #1 & #2

Memory transmission and Kernel execute are performed in two separate command queues. Command dependency is done by GPU events as defined in OpenCL.

The whole operation is now looped just for performance analysis with the same input data.

As you can see in the timeline, the host is waiting a very long time on the GPU to finish with clWaitForEvents() while the GPU idles most of the time. You can also see the repeated operation. For convenience I also provide the list of all issued OpenCL commands.

My questions now are:

Why is the GPU idling so much? In my head I can easily push all "blue" items together and start the operation right away. Memory transfer is 6 GB/s which is the expected rate.
Why are the kernels executed so late? Why is there a gap between kernel #2 and kernel #3 execution?
Why are memory transfer and kernel not executed in parallel? I use 2 command queues, with only 1 queue it is even worse with performance.

Just by pushing all commands together in my head (keeping dependency of course, so 1st green must start after 1st blue) I can triple performance. I don't know why the GPU is so sluggish. Has anyone some insight?

Some number crunching

Memory Transfer #1 is 253 µs
Memory Transfer #2 is 120 µs
Memory Transfer #3 is 143 µs -- which is always too high for unknown reasons, it should be about 1/2 of #2 or in range 70-80 µs
Kernel #1 is 74 µs
Kernel #2 is 95 µs
Kernel #3 is 107 µs

as Kernel #1 is faster than Memory Transfer #2 and Kernel #2 is faster than Memory Transfer #3 overall time should be:

253 µs + 120 µs + 143 µs + 107 µs = 623 µs

but clWaitForEvents is

1758 µs -- or about 3x as much

Yes, there are some losses and I'm fine with like 10% (60 µs), but 300% is too much.

Try re-using the memory buffers instead of releasing and making new ones. — Jovasa, Jul 13 '17 at 11:16
I just tried that, no change. I'm still sending memory but I keep the memory object. I have to send the memory (although unchanged) because in a real scenario later it does change. — Melissa P, Jul 13 '17 at 11:23
You need to overlap memory an IO and increase the global size. Otherwise the test is dominated by IO latencies. 100us for a kernel execution is very fast and is about the same as the time to set up the kernel and launch it. — DarkZeros, Jul 13 '17 at 15:26

huseyin tugrul buyukisik · Accepted Answer · 2017-07-13T17:33:31.780

As @DarkZeros has said, you need to hide kernel-enqueue overhead by using multiple command queues to overlap them in time-line.

Why is the GPU idling so much?

Because you are using 2 command queues and they are running serially (probably) with events that make them wait longer.

You should use single queue if everything is serial. You should let two queues overlap actions if you can add double-buffering or similar techniques to advance computations.

Why are the kernels executed so late?

The wide holes consist of host-side latencies such as enqueueing commands, flushing commands to device, host-side algorithms and device-side event control logic. Maybe events can get as small as 20-30 microseconds but host-device interactons are more than that.

If you get rid of events and use single queue, drivers may even add early compute techniques to fill those gaps even before you enqueue those commands(maybe) just as CPUs do early branching(predicting).

Why are memory transfer and kernel not executed in parallel?

There is no enforcement but drivers can also check dependencies between kernels and copies and to keep the data intact, they can halt some operations until some others finish (maybe).

Are you sure kernels and buffer copies are completely independent?

Another reason could be two queues don't have much to choose to overlap. If both queues have both types of operations, they would have more options to overlap such as kernel + kernel, copy + copy instead of just kernel+copy.

If program has too many small kernels, you may try OpenCL 2.0 dynamic parallelism which makes device call kernels itself which is faster than host-side enqueue.

Maybe you can add a higher level parallelism such as image-level parallelism (if its image processing you do) to keep gpu busy. Work on 5-10 images at the same time which should ensure independent kernel/buffer executions unless all images are in same buffer. If that doesn't work, then you can launch 5-10 processes of same program(process level parallelism). But having too many contexts can stuck at driver limitations so image level parallelism must be better.

R9 390 must be able to process with 8-16 command queues.

1758 µs

Sometimes even empty kernels make it wait for 500-100 µs. Most probably you should enqueue 1000 cycles, wait once at the end. If each cycle works after a user-button-click, then user wouldn't notice the 1.7 ms latency already.

Use many queues.
Get rid of events between queues(if any).
Have each queue do all kinds of work.
Have many iterations before a single wait for event on host side.
If OpenCL 2.0 exists, try device-side enqueue too, but that works only for kernel executions, not for copies to/from host.

I'm beginning to understand. I only use 1 queue now, I added an inner loop and executed the same kernel and memory transfer hundreds of times and the overall execution time per transfer and kernel goes down, about to 900 µs per run now or twice as fast. Still, although independent, memory and kernel never execute in parallel, it's always all memory first and then all execute. The driver reorders commands. — Melissa P, Jul 13 '17 at 16:49
If there are pairs of compute+copy then try this: queue-1: compute+copy repeat, queue-2: compute+copy repeat. This makes more freedom of overlapping such as even kernels can now overlap but ofcourse overlapping copies may not yield much. 70-90 microseconds is very overlappable on many queues imo so try 5-10 queues each with both compute and copies. — huseyin tugrul buyukisik, Jul 13 '17 at 17:13
I think that is starting to work. With 10 queues in parallel I get a time of 500 µs per unit in average (transfer + execute) which is very close to the theoretical minimum of 450 µs (memory bandwidth is the bottleneck). I'm very surprised that the driver is not able to determine dependencies and notably adds unnecessary ones, and second that 1 command queue alone is not sufficient to occupy the GPU. Thanks a lot. — Melissa P, Jul 13 '17 at 17:37
Single queue can keep gpu busy if it does trillions of calculations. Think like multi core CPU. Multiplying only a few hundred array items not worthy for even 2 cores. — huseyin tugrul buyukisik, Jul 13 '17 at 17:51

Slow GPU performance on OpenCL kernel

1 Answers1