8

I have a vertex buffer that is stored in a device memory and a buffer and is host visible and host coherent.

To write to the vertex buffer on the host side I map it, memcpy to it and unmap the device memory.

To read from it I bind the vertex buffer in a command buffer during recording a render pass. These command buffers are submitted in a loop that acquires, submits and presents, to draw each frame.

Currently I write once to the vertex buffer at program start up.

The vertex buffer then remains the same during the loop.

I'd like to modify the vertex buffer between each frame from the host side.

What I'm not clear on is the best/right way to synchronize these host-side writes with the device-side reads. Currently I have a fence and pair of semaphores for each frame allowed simulatenously in flight.

For each frame:

  1. I wait on the fence.

  2. I reset the fence.

  3. The acquire signals semaphore #1.

  4. The queue submit waits on semaphore #1 and signals semaphore #2 and signals the fence.

  5. The present waits on semaphore #2

Where is the right place in this to put the host-side map/memcpy/unmap and how should I synchronize it properly with the device reads?

Andrew Tomazos
  • 66,139
  • 40
  • 186
  • 319
  • "*To write to the vertex buffer on the host side I map it, memcpy to it and unmap the device memory.*" Mapping and unmapping of a memory allocation should each happen *exactly once* in your application. If you're using mappable memory, map it after allocation and leave it mapped until you deallocate it. Also, if you're mapping "device memory", why do you need the other buffer at all? – Nicol Bolas Feb 12 '19 at 15:47
  • @NicolBolas: I'm taking about a VkDeviceMemory and VkBuffer. and vkMapMemory and vkUnmapMemory. The VkBuffer is backed by the VkDeviceMemory. See "Filling the vertex buffer" here: https://vulkan-tutorial.com/Vertex_buffers/Vertex_buffer_creation – Andrew Tomazos Feb 12 '19 at 16:00
  • Yes, I know how the Vulkan API works. I'm asking you why you need to map and unmap memory every frame, rather than just leaving it mapped. "Because some tutorial did it" is *never* a good answer. "*The VkBuffer is backed by the VkDeviceMemory.*" OK, so you didn't mean device *local* memory. A (useable) `VkBuffer` is *always* backed by `VkDeviceMemory`; it's not really something you have point out. – Nicol Bolas Feb 12 '19 at 16:04
  • 1
    @NicolBolas: So you are saying to map on program startup and unmap on program shutdown - and then just memcpy on each frame. Fine, agreed. Where do I put the memcpy in the loop and how do I synchronize that write with the device-side read? – Andrew Tomazos Feb 12 '19 at 16:09

1 Answers1

12

If you want to take advantage of asynchronous GPU execution, you want the CPU to avoid having to stall for GPU operations. So never wait on a fence for a batch that was just issued. The same thing goes for memory: you should never desire to write to memory which is being read by a GPU operation you just submitted.

You should at least double-buffer things. If you are changing vertex data every frame, you should allocate sufficient memory to hold two copies of that data. There's no need to make multiple allocations, or even to make multiple VkBuffers (just make the allocation and buffers bigger, then select which region of storage to use when you're binding it). While one region of storage is being read by GPU commands, you write to the other.

Each batch you submit reads from certain memory. As such, the fence for that batch will be set when the GPU is finished reading from that memory. So if you want to write to the memory from the CPU, you cannot begin that process until the fence representing the GPU reading operation for that memory reading gets set.

But because you're double buffering like this, the fence for the memory you're about to write to is not the fence for the batch you submitted last frame. It's the batch you submitted the frame before that. Since it's been some time since the GPU received that operation, it is far less likely that the CPU will have to actually wait. That is, the fence should hopefully already be set.

Now, you shouldn't do a literal vkWaitForFences on that fence. You should check to see if it is set, and if it isn't, go do something else useful with your time. But if you have nothing else useful you could be doing, then waiting is probably OK (rather than sitting and spinning on a test).

Once the fence is set, you know that you can freely write to the memory.


How do I know that the memory I have written to with the memcpy has finished being sent to the device before it is read by the render pass?

You know because the memory is coherent. That is what VK_MEMORY_PROPERTY_HOST_COHERENT_BIT means in this context: host changes to device memory are visible to the GPU without needing explicit visibility operations, and vice-versa.

Well... almost.

If you want to avoid having to use any synchronization, you must call vkQueueSubmit for the reading batch after you have finished modifying the memory on the CPU. If they get called in the wrong order, then you'll need a memory barrier. For example, you could have some part of the batch wait on an event set by the host (through vkSetEvent), which tells the GPU when you've finished writing. And therefore, you could submit that batch before performing the memory writing. But in this case, the vkCmdWaitEvents call should include a source stage mask of HOST (since that's who's setting the event), and it should have a memory barrier whose source access flag also includes HOST_WRITE (since that's who's writing to the memory).

But in most cases, it's easier to just write to the memory before submitting the batch. That way, you avoid needing to use host/event synchronization.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • 2
    I agree that if I wait for the fence signaled by queue submit, before issuing the memcpy, the device will have finished reading before the host writes. What isn't clear is the other side. How do I know that the memory I have written to with the memcpy has finished being sent to the device before it is read by the render pass? When the memcpy returns, it isn't immediately available on the device, correct? Don't I have to synchronize the write completing with the read commencing? ie sequence these some how? – Andrew Tomazos Feb 12 '19 at 16:35