If you want to take advantage of asynchronous GPU execution, you want the CPU to avoid having to stall for GPU operations. So never wait on a fence for a batch that was just issued. The same thing goes for memory: you should never desire to write to memory which is being read by a GPU operation you just submitted.
You should at least double-buffer things. If you are changing vertex data every frame, you should allocate sufficient memory to hold two copies of that data. There's no need to make multiple allocations, or even to make multiple VkBuffer
s (just make the allocation and buffers bigger, then select which region of storage to use when you're binding it). While one region of storage is being read by GPU commands, you write to the other.
Each batch you submit reads from certain memory. As such, the fence for that batch will be set when the GPU is finished reading from that memory. So if you want to write to the memory from the CPU, you cannot begin that process until the fence representing the GPU reading operation for that memory reading gets set.
But because you're double buffering like this, the fence for the memory you're about to write to is not the fence for the batch you submitted last frame. It's the batch you submitted the frame before that. Since it's been some time since the GPU received that operation, it is far less likely that the CPU will have to actually wait. That is, the fence should hopefully already be set.
Now, you shouldn't do a literal vkWaitForFences
on that fence. You should check to see if it is set, and if it isn't, go do something else useful with your time. But if you have nothing else useful you could be doing, then waiting is probably OK (rather than sitting and spinning on a test).
Once the fence is set, you know that you can freely write to the memory.
How do I know that the memory I have written to with the memcpy has finished being sent to the device before it is read by the render pass?
You know because the memory is coherent. That is what VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
means in this context: host changes to device memory are visible to the GPU without needing explicit visibility operations, and vice-versa.
Well... almost.
If you want to avoid having to use any synchronization, you must call vkQueueSubmit
for the reading batch after you have finished modifying the memory on the CPU. If they get called in the wrong order, then you'll need a memory barrier. For example, you could have some part of the batch wait on an event set by the host (through vkSetEvent
), which tells the GPU when you've finished writing. And therefore, you could submit that batch before performing the memory writing. But in this case, the vkCmdWaitEvents
call should include a source stage mask of HOST
(since that's who's setting the event), and it should have a memory barrier whose source access flag also includes HOST_WRITE
(since that's who's writing to the memory).
But in most cases, it's easier to just write to the memory before submitting the batch. That way, you avoid needing to use host/event synchronization.