OpenGL compute shader mapping to nVidia warps

Question

Let's say I have an OpenGL compute shader with local_size=8*8*8. How do the invocations map to nVidia GPU warps? Would invocations with the same gl_LocalInvocationID.x be in the same warp? Or y? Or z? I don't mean all invocations, I just mean general aggregation.

I am asking this because of optimizations as in one moment, not all invocations have work to do so I want them to be in the same warp.

score 5 · Answer 1 · answered Dec 08 '18 at 18:45

5

The compute shader execution model allows the number of invocations to (greatly) exceed the number of individual execution units in a warp/wavefront. For example, hardware warp/wavefront sizes tend to be between 16 and 64, while the number of invocations within a work group (GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS) is required in OpenGL to be no less than 1024.

barrier calls and using shared variable data when a work group spans multiple warps/wavefronts works essentially by halting the progress of all warps/wavefronts until they each have passed that particular point. And then performing various memory flushing so that they can access each others' variables (based on memory barrier usage, of course). If all of the invocations in a work group fit into a single warp, then it's possible to avoid such things.

Basically, you have no control over how CS invocations are grouped into warps. You can assume that the implementation is not trying to be slow (that is, it will generally group invocations from the same work group into the same warp), but you cannot assume that all invocations within the same work group will be in the same warp.

Nor should you assume that each warp only executes invocations from the same work group.

answered Dec 08 '18 at 18:45

Nicol Bolas

449,505
63
781
982

I see I've asked the question wrong. I didn't mean all invocations, I was just asking about general aggregation - for which dimension are the invocations aggregated first. – Danol Dec 08 '18 at 21:09
@Danol: I'm curious as to what you're planning to do with that information. Even if you restrict yourself to a *specific* GPU with a known set of drivers, how would you write your code differently? If you don't want inter-warp cross-talk, then keep your work group size the same as the warp size. – Nicol Bolas Dec 08 '18 at 21:29
Let's say I have a task that only needs 25% of the work group invocations running. I wanted to have the running portion in the same warp, so the warps with inactive invocations could be skipped. I then found out gl_LocalInvocationIndex which, assuming it is correlated with the warp distribution, should solve the issue. – Danol Dec 08 '18 at 22:10
@Danol: "*Let's say I have a task that only needs 25% of the work group invocations running.*" Stop: why do you have that? Why would you dispatch more work than you actually want to do? Wouldn't it be better to simply adjust your dispatch so that you're only dispatching what you want done? – Nicol Bolas Dec 08 '18 at 23:38
Let's say I have n time-extensive computations whose results I use in all the workgroup invocations. Those computations are algoritmically same, they just differ in parameters. Now to optimize things, I let the computations run once in parallel and store results in the shared variable. If workgroup size m > n, I am not using the entire workgroup but only portion of the threads. – Danol Dec 09 '18 at 16:13
@Danol: Sure, sometimes the number of invocations will exceed the actual amount of work to be done. But you said "*only needs 25% of the work group invocations running*" That means that `m` is **four times greater** than `n`. Unless the amount of work you're doing is exceedingly tiny (like 4 threads), this is basically something that should never happen. And in the case of a small amount of work, you would be dispatching just one workgroup, and that group size should probably be the warp size. So you would just be executing a single warp. – Nicol Bolas Dec 09 '18 at 16:20
I disagree that it should never happen. Imo it can happen; it happened to me. About the workgroup: yes that was the point of my question. I was wondering which dimension of gl_LocalInvocationID I should use so the working invocations would be in the same warp (I didn't know of gl_LocalInvocationIndex yet). – Danol Dec 09 '18 at 16:31
@Danol: "*Imo it can happen; it happened to me.*" But I don't understand *why* it happened to you. That is, if it happens, it seems to me that the source is ultimately pathological/incorrect use of the API, not a failure in the API itself. Why is your workgroup size so large, yet you have so few invocations to process? Wouldn't it make more sense to make your workgroup size the warp size? And if you can't do that because all threads in a WG need to inter-communicate, then you still need that intercommunication even if those threads aren't doing useful work, yes? – Nicol Bolas Dec 09 '18 at 17:31
@Danol: The example you provide: "*Those computations are algoritmically same, they just differ in parameters.*" is an example of this issue. If the threads don't need to communicate, then the work group size is irrelevant to how they process their data. This means the work group size can be *whatever you want*. Therefore, it should be the size of the warp. And therefore, all of your dispatch calls should execute some number of warps. – Nicol Bolas Dec 09 '18 at 17:47
It is a precomputation phase. In the following code, I use results of these computations in a parallelized code that uses all the invocations fully. – Danol Dec 09 '18 at 18:53
1

@Danol - Are you by any chance trying to do that precomputation phase and the main phase of the task in one go? If that's not the case I don't know why you would wanna know this information. I think your precomputation phase is very small as compared to the main task. You should either process it on the CPU and pass it to the GPU or maybe use a separate invocation of the shader altogether with suitable workgroup size for that precomputation. – gallickgunner Dec 19 '18 at 07:48

score -5 · Answer 2 · answered Dec 08 '18 at 18:15

-5

According to this: https://www.khronos.org/opengl/wiki/Compute_Shader#Inputs

  gl_LocalInvocationIndex =
          gl_LocalInvocationID.z * gl_WorkGroupSize.x * gl_WorkGroupSize.y +
          gl_LocalInvocationID.y * gl_WorkGroupSize.x + 
          gl_LocalInvocationID.x;

So it is quite safe to assume that invocations with the same gl_LocalInvocationID.x are in the same warp.

answered Dec 08 '18 at 18:15

Danol

368
1
15

On NVIDIA (CUDA platform), warps have a size of 32 threads, so assuming the `gl_LocalInvocationID` corresponds to a CUDA thread (or OpenCL work item), invocations with the same `floor(gl_LocalInvocationIndex / 32)` should be in the same warp. – tmlen Dec 08 '18 at 18:18
@tmlen Right, and because of that, invocations with the same gl_LocalInvocationId.x should be in the same thread. – Danol Dec 08 '18 at 18:22

OpenGL compute shader mapping to nVidia warps

2 Answers2