Vullkan compute shader caches and barriers

Question

I'm trying to understand how the entire L1/L2 flushing works. Suppose I have a compute shader like this one

layout(std430, set = 0, binding = 2) buffer Particles{
    Particle particles[];
};


layout(std430, set = 0, binding = 4) buffer Constraints{
    Constraint constraints[];
};


void main(){
    const uint gID = gl_GlobalInvocationID.x;
    for (int pass=0;pass<GAUSS_SEIDEL_PASSES;pass++){
        // first query the constraint, which contains particle_id_1 and particle_id_1
        const Constraint c = constraints[gID*GAUSS_SEIDEL_PASSES+pass]; 
        // read newest positions
        vec3 position1 = particles[c.particle_id_1].position; 
        vec3 position2 = particles[c.particle_id_2].position;
        // modify position1 and position2
        position1 += something;
        position2 -= something;
        // update positions
        particles[c.particle_id_1].position = position1;
        particles[c.particle_id_2].position = position2;
        // in the next iteration, different constraints may use the updated positions
    }
}

From what I understand, initially all data resides in L2. When I read particles[c.particle_id_1].position I copy some of the data from L2 to L1 (or directly to a register). Then in position1 += something I modify L1 (or the register). Finally in particles[c.particle_id_2].position = position1, I flush the data from L1 (or a register) back to L2, right? So if I then have a second compute shader that I want to run afterward this one, and that second shader will read positions of particles, I do not need to synchronize Particles. It would be enough to just put an execution barrier, without memory barrier

void vkCmdPipelineBarrier(
    VkCommandBuffer                             commandBuffer,  
    VkPipelineStageFlags                        srcStageMask, // here I put VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
    VkPipelineStageFlags                        dstStageMask, // here I put VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
    VkDependencyFlags                           dependencyFlags, // here nothing
    uint32_t                                    memoryBarrierCount, // here 0
    const VkMemoryBarrier*                      pMemoryBarriers, // nullptr
    uint32_t                                    bufferMemoryBarrierCount, // 0
    const VkBufferMemoryBarrier*                pBufferMemoryBarriers,  // nullptr
    uint32_t                                    imageMemoryBarrierCount, // 0
    const VkImageMemoryBarrier*                 pImageMemoryBarriers);  // nullptr

score 5 · Accepted Answer · answered Aug 05 '21 at 13:31

Vulkan's memory model does not care about "caches" as caches. Its model is built on the notion of availability and visibility. A value produced by GPU command/stage A is "available" to GPU command/stage B if the command/stage A has an execution dependency with command/stage B. A value produced by GPU command/stage A is "visible" to GPU command/stage B if command/stage A has a memory dependency with command/stage B with regard to the particular memory in question and the access modes that A wrote it and B will access it.

If a value is not both available and visible to a command/stage, then attempting to access it yields undefined behavior.

The implementation of availability and visibility will involve clearing caches and the like. But as far as the Vulkan memory model is concerned, this is an implementation detail it doesn't care about. Nor should you: understand the Vulkan memory model and write code that works within it.

Your pipeline barrier creates an execution dependency, but not a memory dependency. Therefore, values written by CS processes before the barrier are available to CS processes afterwards, but not visible to them. You need to have a memory dependency to establish visibility.

However, if you want a GPU level understanding... it all depends on the GPU. Does the GPU have a cache hierarchy, an L1/L2 split? Maybe some do, maybe not.

It's kind of irrelevant anyway, because merely writing a value to an address in memory is not equivalent to a "flush" of the appropriate caches around that memory. Even using the coherent qualifier would only cause a flush for compute shader operations executing within that same dispatch call. It would not be guaranteed to affect later dispatch calls.

score 2 · Answer 2 · answered Aug 05 '21 at 13:30

Implementation-dependent. For all we know, a device might have no cache at all, or in future it might be some quantum magic bs.

Shader assignment operation does not imply anything about anything. There's no "L1" or "L2" mentioned anywhere in the Vulkan specification. It is a concept that does not exist.

Completely divorce ourselves from the cache stuff, and all mental bagage that comes with it.

What is important here is that when you read something, then that thing needs to be "visible to" the reading agent (irrespective of what kind of device you use, and what obscure memory architecture it might have). If it is not "visible to", then you might be reading garbage.

When you write something, this does not happen automatically. The writes are not "visible to" anyone.

First you put your writes into src* part of a memory dependency (e.g. via a pipeline barrier). That will make your writes "available from".

Then you put your reader into dst* that will take all referenced writes that are "available from" and make them "visible to" the second synchronization scope.

If you really want to shoehorn this into a cache system concept, don't think of it as levels of cache. Think of it as separate caches. That something is already in some cache does not mean it is in the particular cache the consumer needs.

Vullkan compute shader caches and barriers

2 Answers2