Does the guarantee of non-divergence when dispatching single work item exist?

Question

As we know, work items running on GPUs could diverge when there are conditional branches. One of those mentions exist in Apple's OpenCL Programming Guide for Mac.

As such, some portions of an algorithm may run "single-threaded", having only 1 work item running. And when it's especially serial and long-running, some applications take those work back to CPU.

However, this question concerns only GPU and assume those portions are short-lived. Do these "single-threaded" portions also diverge (as in execute both true and false code paths) when they have conditional branches? Or will the compute units (or processing elements, whichever your terminology prefers) skip those false branches?

Update

In reply to comment, I'd remove the OpenCL tag and leave the Vulkan tag there.

I included OpenCL as I wanted to know if there's any difference at all between clEnqueueTask and clEnqueueNDRangeKernel with dim=1:x=1. The document says they're equivalent but I was skeptical.

I believe Vulkan removed the special function to enqueue a single-threaded task for good reasons, and if I'm wrong, please correct me.

"*Vulkan removed the special function to enqueue a single-threaded task*" Vulkan is not an upgrade or alternate version of OpenCL. It therefore didn't "remove" something that never existed. Vulkan's scope begins and end with the GPU; therefore, *adding* a function to execute CPU tasks makes no sense. They didn't remove anything because Vulkan didn't start as a form of OpenCL. — Nicol Bolas, Aug 08 '19 at 13:53
Even when you have multiple work items, if the compiler generated a true conditional branch for a code sequence, and all *active* work items in a thread group have the same result for the condition, only one side of the branch is executed. If you've only got one thread in the thread group, then that is always the case. So anything that the compiler generates a true branch for will only execute one side of the branch. Of course, the compiler might transform some if/else sequences into unconditional computation with a conditional move. — Jesse Hall, Aug 10 '19 at 04:00

solidpixel · Answer 1 · 2019-08-12T08:49:47.407

Do these "single-threaded" portions also diverge (as in execute both true and false code paths) when they have conditional branches?

From an API point of view it has to appear to the program that only the active branch paths were taken. As to what actually happens, I suspect you'll never know for sure. GPU hardware architectures are nearly all confidential so it's impossible to be certain.

There are really two cases here:

Cases where a branch in the program turns into a real branch instruction.
Cases where a branch in the program turns into a conditional select between two computed values.

In the case of a real branch I would expect most cases to only execute the active path because it's a horrible waste of power to do both, and GPUs are all about energy efficiency. That said, YMMV and this isn't guaranteed at all.

For simple branches the compiler might choose to use a conditional select (compute both results, and then select the right answer). In this case you will compute both results. The compiler heuristics will generally aim to choose this where computing both results is less expensive than actually having a full branch.

I included OpenCL as I wanted to know if there's any difference at all between clEnqueueTask and clEnqueueNDRangeKernel with dim=1:x=1. The document says they're equivalent but I was skeptical.

Why would they be different? They are doing the same thing conceptually ...

I believe Vulkan removed the special function to enqueue a single-threaded task for good reasons, and if I'm wrong, please correct me.

Vulkan compute dispatch is in general a whole load simpler than OpenCL (and also perfectly adequate for most use cases), so many of the host-side functions from OpenCL have no equivalent in Vulkan. The GPU side behavior is pretty much the same. It's also worth noting that most of the holes where Vulkan shaders are missing features compared to OpenCL are being patched up with extensions - e.g. VK_KHR_shader_float16_int8 and VK_KHR_variable_pointers.

Overall I agree, but just to clarify: if both branches are cheap to compute, doing both and doing a conditional select is often going to be more efficient than branching, even in single-threaded cases, so I'd expect most architectures to do that (though as you say, no guarantees). I'm thinking of cases like "if (cond) x = y+1 else x = y-1;". If you start having several instructions on one or both sides, or expensive instructions like memory reads, then they'll probably do a true branch. — Jesse Hall, Aug 10 '19 at 03:56

user3666197 · Answer 2 · 2019-08-08T08:45:19.127

-1

Q : Or will the compute units skip those false branches?

The ecosystem of CPU / GPU code-execution is rather complex.

The layer of hardware is where the code-paths (translated into "machine"-code) operate. On this laye, the SIMD-Computing-Units cannot and will not skip anything they are ordered to SIMD-process by the hardware-scheduler (next layer).

The layer of hardware-specific scheduler (GPUs have typically right two-modes: a WARP-mode scheduling for coherent, non-diverging code-paths efficiently scheduled in SIMD-blocks and greedy-mode scheduling). From this layer, the SIMD-Computing-Units are loaded to work on SIMD-operated blocks-of-work, so any first divergence detected on the lower layer (above) breaks the execution, flags the SIMD-hardware scheduler about blocks, deferred to be executed later and all known SIMD-specific block-device-optimised scheduling is well-known to start to grow less-efficient and less-efficient, due to each such run-time divergence.

The layer of { OpenCL | Vulkan API }-mediated device-specific programming decides a lot about the ease or comfort of human-side programming of the wide range of the target-devices, all without knowing about its respective internal constraints, about (compiler decided) preferred "machine"-code computing problem re-formulation and device-specific tricks and scheduling. A bit oversimplified battlefield picture has made for years human-users just stay "in front" of the mediated asynchronous work-units ( kernel's ) HOST-to-DEVICE scheduling queues and wait until we receive back the DEVICE-to-HOST delivered results back, doing some prior-H2D/posterior-D2H memory transfers, if allowed and needed.

The HOST-side DEVICE-kernel-code "scheduling" directives are rather imperative and help the mediated-device-specific programming reflect user-side preferences, yet leave user blind from seeing all internal decisions ( assembly-level reviews are indeed only for hard-core, DEVICE-specific, GPU-engineering Aces and hard to modify, if willing to )

All that said, "adaptive" run-time values' based decisions to move a particular "work-unit" back-to-the-HOST-CPU, rather than finalising it all in DEVICE-GPU, are not, to the best of my knowledge, taking place on the bottom of this complex computing ecosystem hierarchy ( afaik, it would be exhaustively expensive to try to do so ).

edited Aug 08 '19 at 08:45

answered Aug 08 '19 at 07:39

user3666197

1
6
50
92

1

Not sure what device-host interactions have to do with this. The question was about GPU-side control flow divergence. – solidpixel Aug 08 '19 at 07:51
Not exactly, @solidpixel, the O/P formulated other point of view or wish or expectation -- Ref.: "... some portions of an algorithm may run "single-threaded", having only 1 work item running. And when it's especially serial and long-running, some **applications take those work back to CPU**." -- which needed a bit more detailed discourse into the many-layered problem, showing principal limits of compile-time, resp. run-time interactions. **Run-time changes of decisions about on-CPU-scheduling / on-GPU-scheduling would be both latency-wise and Amdahl-Law-wise awfully expensive / inefficient.** – user3666197 Aug 08 '19 at 08:26
@user3666197: But here's the thing: running work on the CPU is not a thing that happens on Vulkan. The question is talking about a thing that doesn't exist. And I know that when you answered it, it had OpenCL in the tags. But that has since been clarified by the OP and removed. – Nicol Bolas Aug 08 '19 at 13:51
There is **no need to convince me** about the known facts. Yet, explaining a complex problem in a comprehensible manner, so as to reach both advanced users, practitioners and also not so experienced users is not easy. Building some common grounds for better understanding is more important for the start of each individual's own learning curve. Neither explaining PTX-metacode nor the SPIR-V Intermediate Representation formats' limits and device-specific adaptations for the on-chip scheduling tricks will definitely be more exact, yet at a cost of being (almost) in-comprehensible for wider public. – user3666197 Aug 08 '19 at 14:04
@NicolBolas with all your professional pool of experience with GPU-based computing, let me repeat here the O/P's words, with DannyNiu's **focus expressed on "serial and long-running" computing-kernels**, which is a common form for some GPU-based computing support for generating + further operating a device-side numerical model, then further used by the host-side application logic ( running there still on CPU ). **This is different from typical {2D|3D}-graphic acceleration** ( which is clear to be one-way pipelined flow with no need for some interactions get propagated back from device to host) – user3666197 Aug 08 '19 at 14:24
've gone through a quite similar problem & worries as the O/P use-case is showing, where a pair of differently specialised, smart-crafted, infinitely-running device-side kernels were operating a GPU-side of a distributed-computing ( a low-latency AI-neural-network kernel "communicating" with other GPU-kernel and the other kernel "communicating" with the outside world (where CPU-application logic was feeding data,orchestrating external interactions of this dual-"embedded" inner-system), so kindly accept there are cases for very strange kernel-side SIMD-threading setups with GPU/CPU interactions – user3666197 Aug 08 '19 at 14:36
@user3666197: "*so kindly accept there are cases for very strange kernel-side SIMD-threading setups with GPU/CPU interactions*" I never claimed that there aren't use cases for it. I said that Vulkan *doesn't support them*. And since the question is about Vulkan specifically, such setups are irrelevant to what is being asked, even if the OP thinks they are relevant. It's no different than asking about screws when saying that you're using a hammer. Wrong tool for the wrong job. – Nicol Bolas Aug 08 '19 at 14:38
Fully agree with Maslow's Hammer [ btw. wrong ears here to shout at :o) ] – user3666197 Aug 08 '19 at 14:40

Does the guarantee of non-divergence when dispatching single work item exist?

2 Answers2