3

I am trying to implement a Spinlock in GLSL. It will be used in the context of Voxel Cone Tracing. I try to move the information, which stores the lock state, to a separate 3D texture which allows atomic operations. In order to not waste memory I don't use a full integer to store the lock state but only a single bit. The problem is that without limiting the maximum number of iterations, the loop never terminates. I implemented the exact same mechanism in C#, created a lot of tasks working on shared resources and there it works perfectly. The book Euro Par 2017: Parallel Processing Page 274 (can be found on Google) mentions possible caveats when using locks on SIMT devices. I think the code should bypass those caveats.

Problematic GLSL Code:

void imageAtomicRGBA8Avg(layout(RGBA8) volatile image3D image, layout(r32ui) volatile uimage3D lockImage,
    ivec3 coords, vec4 value)
{
    ivec3 lockCoords = coords;

    uint bit = 1<<(lockCoords.z & (4)); //1<<(coord.z % 32)  
    lockCoords.z = lockCoords.z >> 5;  //Division by 32    

    uint oldValue = 0;
    //int counter=0;
    bool goOn = true;
    while (goOn /*&& counter < 10000*/)
    //while(true)
    {
        uint newValue = oldValue | bit;
        uint result = imageAtomicCompSwap(lockImage, lockCoords, oldValue, newValue);

        //Writing is allowed if could write our value and if the bit indicating the lock is not already set
        if (result == oldValue && (result & bit) == 0) 
        {
            vec4 rval = imageLoad(image, coords);
            rval.rgb = (rval.rgb * rval.a); // Denormalize
            vec4 curValF = rval + value;    // Add
            curValF.rgb /= curValF.a;       // Renormalize   
            imageStore(image, coords, curValF);

            //Release the lock and set the flag such that the loops terminate
            bit = ~bit;
            oldValue = 0;
            while (goOn)
            {
                newValue = oldValue & bit;
                result = imageAtomicCompSwap(lockImage, lockCoords, oldValue, newValue);
                if (result == oldValue) 
                    goOn = false; //break;
                oldValue = result;
            }
            //break;
        }
        oldValue = result;
        //++counter;
    }
}

Working C# code with identical functionality

public static void Test()
{
    int buffer = 0;
    int[] resource = new int[2];
    Action testA = delegate ()
    {
        for (int i = 0; i < 100000; ++i)
            imageAtomicRGBA8Avg(ref buffer, 1, resource);
    };
    Action testB = delegate ()
    {
        for (int i = 0; i < 100000; ++i)
            imageAtomicRGBA8Avg(ref buffer, 2, resource);
    };

    Task[] tA = new Task[100];
    Task[] tB = new Task[100];
    for (int i = 0; i < tA.Length; ++i)
    {
        tA[i] = new Task(testA);
        tA[i].Start();
        tB[i] = new Task(testB);
        tB[i].Start();
    }

    for (int i = 0; i < tA.Length; ++i)
        tA[i].Wait();
    for (int i = 0; i < tB.Length; ++i)
        tB[i].Wait();
}

public static void imageAtomicRGBA8Avg(ref int lockImage, int bit, int[] resource)
{
    int oldValue = 0;
    int counter = 0;
    bool goOn = true;
    while (goOn /*&& counter < 10000*/)
    {
        int newValue = oldValue | bit;
        int result = Interlocked.CompareExchange(ref lockImage, newValue, oldValue); //imageAtomicCompSwap(lockImage, lockCoords, oldValue, newValue);
        if (result == oldValue && (result & bit) == 0)
        {
            //Now we hold the lock and can write safely
            resource[bit - 1]++;

            bit = ~bit;
            oldValue = 0;
            while (goOn)
            {
                newValue = oldValue & bit;
                result = Interlocked.CompareExchange(ref lockImage, newValue, oldValue); //imageAtomicCompSwap(lockImage, lockCoords, oldValue, newValue);
                if (result == oldValue)
                    goOn = false; //break;
                oldValue = result;
            }
            //break;
        }
        oldValue = result;
        ++counter;
    }
}

The locking mechanism should work quite identical as the one described in OpenGL Insigts Chapter 22 Octree-Based Sparse Voxelization Using the GPU Hardware Rasterizer by Cyril Crassin and Simon Green. They just use integer textures to store the colors for every voxel which I would like to avoid because this complicates Mip Mapping and other things. I hope the post is understandable, I get the feeling it is already becoming too long...

Why does the GLSL implementation not terminate?

Max Young
  • 1,522
  • 1
  • 16
  • 42
noname2
  • 31
  • 1

2 Answers2

1

If I understand you well, you use lockImage as thread-lock: A determined value at determined coords means "only this shader instance can do next operations" (change data in other image at that coords). Right.

The key is imageAtomicCompSwap. We know it did the job because it was able to store that determined value (let's say 0 means "free" and 1 means "locked"). We know it because the returned value (the original value) is "free" (i.e. the swap operation happened):

bool goOn = true;
unit oldValue = 0; //free
uint newValue = 1; //locked
//Wait for other shader instance to free the simulated lock
while ( goON )
{
    uint result = imageAtomicCompSwap(lockImage, lockCoords, oldValue, newValue);
    if ( result == oldValue ) //it was free, now it's locked
    {
        //Just this shader instance executes next lines now.
        //Other instances will find a "locked" value in 'lockImage' and will wait
        ...
        //release our simulated lock
        imageAtomicCompSwap(lockImage, lockCoords, newValue, oldValue);
        goOn = false;
    }
}

I think your code loops forever because you complicated your life with bitvar and did a wrong use of oldVale and newValue

EDIT:

If the 'z' of the lockImage is multiple of 32 (just a hint for understanding, no needed exact multiple), you try to pack 32 voxel-locks in an integer. Let's call this integer 32C.

A shader instance ("SI") may want to change its bit in 32C, lock or unlock. So you must (A)get the current value and (B)change only your bit.

Other SIs are trying to change their bits. Some with the same bit, others with different bits.

Between two calls to imageAtomicCompSwap in the one SI, other SI may have changed not your bit (it's locked, no?) but other bits in the same 32C value. You don't know which is the current value, you know only your bit. Thus you have nothing (or an old wrong value) to compare with in the imageAtomicCompSwap call. It likely fails to set a new value. Several SIs failing leads to "deadlocks" and the while-loop never ends.

You try to avoid using an old wrong value by oldValue = result and trying again with imageAtomicCompSwap. This the (A)-(B) I wrote before. But between (A) and (B) still other SI may have changed the result= 32C value, ruining your idea.

IDEA: You can use my simple approach (just 0 or 1 values in lockImage), without bits thing. The result is that lockImage is smaller. But all shader instances trying to update any of the 32 image coords related to a 32C value in lockImage will wait until the one who locked that value frees it.

Using another lockImage2 just to lock-unlock the 32C value for a bit update, seems too much spinning.

Ripi2
  • 7,031
  • 1
  • 17
  • 33
  • Thanks for the answer. With the code you suggested I would only be able to store the information of one lock per 32bit integer instead of 32 when using all bits separately. The 3D texture would take 32 times more memory. – noname2 Oct 08 '17 at 19:06
  • @noname2 I've update my answer with your "pack" approach. – Ripi2 Oct 09 '17 at 16:42
  • I tested a bit more myself and it turned out that I simply forgot the coherent keyword, volatile is not enough. It works now. The flickering that remains is due to 8bit floating point overflow when computing the floating average. I would upvote your answer to appreciate your effort but I don't have enough repuatiton points :) – noname2 Oct 09 '17 at 19:44
  • @noname2 Post yourself as an answer the solution you found. And accept it. Other users may find it useful in a future. Also, there's a mention in the [wiki OpenGL Image load store](https://www.khronos.org/opengl/wiki/Image_Load_Store) about *coherent* keyword. Add a link to it in your answer. – Ripi2 Oct 09 '17 at 20:10
  • Anyhow, now I'm worried about why you said "it works now" because that means than my logic fails ;) – Ripi2 Oct 09 '17 at 20:13
-1

I have written article about how to implement per pixel mutex in fragment shader along with code . I think you can refer that. You are doing pretty similar thing that I have explained there. Here we go:

Getting Over Draw Count and Per Pixel Mutex

what is overdraw count ?

Mostly on embedded hardware the major concern for performance drop could be overdraw. Basically one pixel on screen is shaded multiple times by the GPU due to nature of geometry or scene we are drawing and this is called as overdraw. There are many tools to visualize overdraw count.

Details about overdraw?

When we draw some vertices those vertices will be transformed to clip space then to window coordinates. Rasterizer then maps this coordinates to pixels/fragments.Then for pixels/fragments GPU calls pixel shader. There could be cases when we are drawing multiple instance of geometry and blending them. So, this will do drawing on same pixel multiple times.This will lead to overdraw and could degrade performance.

Strategies to avoid overdraw?

  1. Consider Frustum culling - Do frustum culling on CPU so that objects out of cameras field of view will not be rendered.

  2. Sort objects based on z - Draw objects from front to back this way for later objects z test will fail and the fragment wont be written.

  3. Enable back face culling - Using this we can avoid rendering back faces that are looking towards camera. 

If you observe point 2, we are rendering in exactly reverse order for blending.We are rendering from back to front. We need to do this because blending happens after z test. If for any fragment fails z test then though it is at back we should still consider it as blending is on but, that fragment will be completely ignored giving artifacts.Hence we need to maintain order from back to front. Due to this when blending is enabled we get more overdraw count.

Why we need Per Pixel Mutex?

By nature GPU is parallel so, shading of pixels can be done in parallel. So there are many instance of pixel shader running at a time. This instances may be shading same pixel and hence accessing same pixels.This may lead to some synchronization issues.This may create some unwanted effects. In this application I am maintaining overdraw count in image buffer initialized to 0. The operations I do are in following order.

  1. Read i pixel's count from image buffer (which will be zero for first time)
  2. Add 1 to value of counter read in step 1
  3. Store new value of counter in ith position pixel in image buffer

As I told you multiple instance of pixel shader could be working on same pixel this may lead to corruption of counter variable.As these steps of algorithm are not atomic. I could have used inbuilt function imageAtomicAdd(). I wanted to show how we can implement per pixel mutex so, I have not used inbuilt function imageAtomicAdd().

 #version 430

 layout(binding = 0,r32ui) uniform uimage2D overdraw_count;
 layout(binding = 1,r32ui) uniform uimage2D image_lock;

 void mutex_lock(ivec2 pos) {
     uint lock_available;
     do {
          lock_available = imageAtomicCompSwap(image_lock, pos, 0, 1);
     } while (lock_available == 0);
  }

 void mutex_unlock(ivec2 pos) {
     imageStore(image_lock, pos, uvec4(0));
 }

 out vec4 color;
 void main() {
     mutex_lock(ivec2(gl_FragCoord.xy));           
     uint count = imageLoad(overdraw_count, ivec2(gl_FragCoord.xy)).x + 1;
     imageStore(overdraw_count, ivec2(gl_FragCoord.xy), uvec4(count));
     mutex_unlock(ivec2(gl_FragCoord.xy));  
 }

Fragment_Shader.fs

About Demo.

In demo video you can see we are rendering many teapots and blending is on.So pixels with more intensity shows there overdraw count is high.

on youtube

Note: On android you can see this overdraw count in debug GPU options.

source: Per Pixel Mutex

aloisdg
  • 22,270
  • 6
  • 85
  • 105
  • 1
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/low-quality-posts/19943892) – Jesse Jun 06 '18 at 11:38
  • @JessedeBruijne included. – aloisdg Jun 06 '18 at 11:56