Question #1: Is that assumption correct?
Probably not, but it depends on your platform whether you support the alternative. For GPU->CPU transfers there are really three options:
1. HOST_VISIBLE
This type is visible to the host and guaranteed to be coherent, but not cached on the host. CPU reads will be very slow but that might be OK if you are only reading back a small amount of data (and might be cheaper than issuing vkInvalidateMappedMemoryRanges()
, and there is little point streaming data into the CPU cache if you never expect to touch it again on the CPU).
2. HOST_VISIBLE | HOST_CACHED
This type is visible to the host and cached, but not guaranteed to be coherent (CPU and GPU might see different things at the same address if you don't manually enforce coherency). For this type of memory you must use vkInvalidateMappedMemoryRanges()
after GPU writes and before CPU reads (or vkFlushMappedRange()
for the other direction) to ensure that one processor can see what the other wrote, or you might read stale data.
Data access will be fast once in the cache, and you can benefit from CPU-side data fetch tricks such as explcit preloads and cache prefetching, but you will pay an overhead for the invalidate operation.
3. HOST_VISIBLE | HOST_CACHED | HOST_COHERENT
Finally you have the host cached AND coherent memory type, which sort of gives you best of both if you have high bandwidth reads on the CPU to make. Hardware provides the coherency implementation automatically, so no need to invalidate, BUT it's not guaranteed to be available on all platforms. For bulk data reads on the CPU I would expect this to be the most efficient in cases where it is available.
It's worth noting that there is no "best" memory settings for all allocations. Do not use host cached or host coherent memory for things you never expect to transfer back to the CPU (memory coherency isn't free in terms of power or memory performance).
Question #2: What is happening under the hood during vkInvalidateMappedMemoryRanges? Is this just a memcpy from some internal cache or can this be a longer procedure?
In the case where you have non-coherent memory then it does whatever is needed to make them coherent. Typically this means invalidating (discarding) cache lines in CPU cache which may contain stale copies of the data, ensuring that subsequent reads by the CPU see the version that the GPU actually wrote.
Question #3: If this could take longer (i.e. it is not a simple memcpy), then I probably should have some possibility to synchronize with the completion of it, right?
No. Invalidation is a CPU-side operation, so it takes CPU time to complete and the CPU is busy while the operation is completing. In general you can avoid the need to do it at all by using coherent memory though.