Manual depth rendering: Random results despite using atomic operations

Question

i'm rendering single-pixel points into a uint32-texture with a compute shader. the texture is a 3d texture, x and y are viewport coordinates, z has depth information on coordinate 0 and additional attributes on 1. so two manually built rendertargets, if you will. code looks like this:

layout (r32ui, binding = 0) coherent volatile uniform uimage3D renderBuffer;
layout (rgba32f, binding = 1) restrict readonly uniform imageBuffer pointBuffer;

for(int j = 0; j < numPoints / gl_WorkGroupSize.x + 1; j++)
{
    vec4 point = imageLoad(pointBuffer, ...)
    // ... transform point ...
    uint originalDepth = imageAtomicMin(renderBuffer, ivec3(imageCoords, 0), point.depth);
    if (originalDepth >= point.depth)
    {
        // write happened, store the attributes
        imageStore(renderBuffer, ivec3(imageCoords, 1), point.attributes);
    }
}

while the depth values are correct, i have a few pixels where the attributes flicker between two values.

the order of points in the pointBuffer is random (but i've verified the set of all points is always the same), so my first thought was that two equal depth values might change the output, depending on which one comes first. so i made it that, if originalDepth == point.depth it uses imageAtomicMax to always have the same of the two alternative attributes written, but that changed nothing.

i scattered barrier() and memoryBarrier() all over the place, but that changed nothing. i also removed all diverging control flow for this, changed nothing.

reducing the local work size to 32 removes 90% of the flickering, but some still remains.

any ideas would be greatly appreciated.

edit: before you ask why i do this stuff manually instead of using normal rasterization and fragment shaders, the reason is performance. the rasterizer does not help since i'm rendering single-pixel-points, shared memory greatly speeded things up, and i render each point multiple times, which required me to use a geometry shader which was slow.

score 2 · Accepted Answer · answered Aug 02 '16 at 14:50

The problem is this: you have a race condition on writing to renderBuffer. If two different CS invocations map to the same pixel, and both of them decide to write the value, then there is a race on your imageStore call. One may overwrite the other, it may be a partial overwrite, or something else entirely. But in any case, it's not guaranteed to work.

This would be best solved by doing what rasterizers do: break the process down into two separate phases. The first phase does the ... transform point ... part, writing that data out to a buffer. The second phase then goes through the points and writes them to the final image.

In phase 2, each CS invocation performs all of the processing for a particular output pixel. That way, there are no race conditions. Of course, that requires that phase 1 produces data in a way that can be ordered per-pixel.

There are several ways to go about the latter. You could use a linked list, with a list per-pixel. Or your could use a list per-workgroup, where a workgroup represents some X/Y region of pixel space. In that case, you would use local shared memory as your local depth buffer, with all CS invocations reading from/writing to that region. After they all get done processing pixels, you write it out to real memory. Basically, you'd be implementing tile-based rendering manually.

Indeed, if you have a lot of these points, a tile-based solution would allow you to incorporate pipelining, so that you don't have to wait until all of phase 1 is done before starting on some of phase 2. You could break phase 1 down into chunks. You start a couple of phase 1 chunks, then a phase 2 chunk that reads from the first phase 1, then another phase 1, and so forth.

Vulkan with its event system, has better tools for building such an efficient dependency chain than OpenGL.

I myself wrote in a damn comment directly above that imageStore stating "potential race condition here" but for some reason decided it couldn't be responsible. Probably got too focused there... Anyways thanks :) I'm rendering into an atlas of 64x64 shadowmaps and already have one work group per SM, but that's too large to have the depth buffer in shared memory. Two questions: Doesn't the per-workgroup list not just move the problem from global to shared memory and i'd need to solve the race condition there as well? Second, the pipelining would require several dispatch calls wouldn't it? — karyon, Aug 02 '16 at 20:04
"*Doesn't the per-workgroup list not just move the problem from global to shared memory and i'd need to solve the race condition there as well?*" Yes, but there are things you can do to solve it there, which are relatively cheap. — Nicol Bolas, Aug 02 '16 at 20:38
A pointer where to start would be nice, since I'm at the end of my knowledge here and this stuff is hardly google-able... — karyon, Aug 02 '16 at 20:54
my fix was to read back the written value after the `imageAtomicMin` (and an added `groupMemoryBarrier`) and check whether it actually is the written one. This works as long as no other workgroup accesses the same pixels. When working in shared memory, one can simply write into shared memory in a loop until the write succeeds, see "High-Performance Software Rasterization on GPUs" by Laine and Karras. — karyon, Aug 03 '16 at 08:14

Manual depth rendering: Random results despite using atomic operations

1 Answers1