Transferring memory from GPU to CPU with Vulkan and vkInvalidateMappedMemoryRanges synchronization?

Question

In Vulkan, when I want to transfer some memory the GPU back to the CPU, I think the most efficient way to do this is to write the data into memory which has the flags VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT.

Question #1: Is that assumption correct?

(Full list of available memory property flags can be found in Vulkan's documentation of VkMemoryPropertyFlagBits)

In order to get the latest data, I have to invalidate the memory using vkInvalidateMappedMemoryRanges, right?

Question #2: What is happening under the hood during vkInvalidateMappedMemoryRanges? Is this just a memcpy from some internal cache or can this be a longer procedure?

Question #3: If this could take longer (i.e. it is not a simple memcpy), then I probably should have some possibility to synchronize with the completion of it, right? However, vkInvalidateMappedMemoryRanges does not offer any synchronization parameters. Actually, my question is: IF I have to synchronize it, HOW do I synchronize it?

score 10 · Accepted Answer · edited Jun 20 '20 at 09:12

Question #1: Is that assumption correct?

Probably not, but it depends on your platform whether you support the alternative. For GPU->CPU transfers there are really three options:

1. HOST_VISIBLE

This type is visible to the host and guaranteed to be coherent, but not cached on the host. CPU reads will be very slow but that might be OK if you are only reading back a small amount of data (and might be cheaper than issuing vkInvalidateMappedMemoryRanges(), and there is little point streaming data into the CPU cache if you never expect to touch it again on the CPU).

2. HOST_VISIBLE | HOST_CACHED

This type is visible to the host and cached, but not guaranteed to be coherent (CPU and GPU might see different things at the same address if you don't manually enforce coherency). For this type of memory you must use vkInvalidateMappedMemoryRanges() after GPU writes and before CPU reads (or vkFlushMappedRange() for the other direction) to ensure that one processor can see what the other wrote, or you might read stale data.

Data access will be fast once in the cache, and you can benefit from CPU-side data fetch tricks such as explcit preloads and cache prefetching, but you will pay an overhead for the invalidate operation.

3. HOST_VISIBLE | HOST_CACHED | HOST_COHERENT

Finally you have the host cached AND coherent memory type, which sort of gives you best of both if you have high bandwidth reads on the CPU to make. Hardware provides the coherency implementation automatically, so no need to invalidate, BUT it's not guaranteed to be available on all platforms. For bulk data reads on the CPU I would expect this to be the most efficient in cases where it is available.

It's worth noting that there is no "best" memory settings for all allocations. Do not use host cached or host coherent memory for things you never expect to transfer back to the CPU (memory coherency isn't free in terms of power or memory performance).

Question #2: What is happening under the hood during vkInvalidateMappedMemoryRanges? Is this just a memcpy from some internal cache or can this be a longer procedure?

In the case where you have non-coherent memory then it does whatever is needed to make them coherent. Typically this means invalidating (discarding) cache lines in CPU cache which may contain stale copies of the data, ensuring that subsequent reads by the CPU see the version that the GPU actually wrote.

Question #3: If this could take longer (i.e. it is not a simple memcpy), then I probably should have some possibility to synchronize with the completion of it, right?

No. Invalidation is a CPU-side operation, so it takes CPU time to complete and the CPU is busy while the operation is completing. In general you can avoid the need to do it at all by using coherent memory though.

That's a very helpful answer. Thank you. One more question: I've noticed that you wrote under 1. HOST_VISIBLE that the memory is guarateed to be coherent. Is that really true? That would mean that `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT` has the same effect as `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT`?! — j00hi, Jun 14 '19 at 12:39
Yes - if you look at the page you linked in your question about the flags there is the following text "Host memory accesses to uncached memory are slower than to cached memory, however uncached memory is always host coherent." — solidpixel, Jun 14 '19 at 13:57

Transferring memory from GPU to CPU with Vulkan and vkInvalidateMappedMemoryRanges synchronization?

1 Answers1

1. HOST_VISIBLE

2. HOST_VISIBLE | HOST_CACHED

3. HOST_VISIBLE | HOST_CACHED | HOST_COHERENT