11

I read a lot of articles on Vulkan memory management and all of them recommend using staging buffers for transferring to the GPU. But we can already create memory that is device local and host visible, host coherent. It is readable from GPU as well as writable from CPU.

Here is what I thought would be reasonable - Create one big buffer bound to a memory which is device local, host visible and coherent. Now for all dynamic buffers, we just keep using memory from this buffer and bind the buffer at that offset.

But in most examples, they create one host visible buffer and one gpu visible and use the copyBuffer operation for transferring. What is the advantage of this over using just one buffer which is accessible to both cpu and gpu? I am only talking about buffers, not textures.

2 Answers2

12

As ratchet freak said, devices aren't required to have a memory type that is both device local and host visible. Even though most do, the size might be limited. Maybe things have changed in the last few years, but it used to be that PCI-E and BIOS limitations means 256 MB or maybe 512 MB was as much as you could get. And finally, CPU writes to over PCI-E are going to be lower bandwidth than to the CPU's own memory. So even though using a staging buffer uses twice as much total bandwidth, if it can be done asynchronously on a transfer queue, it minimizes the time that the CPU and graphics pipeline spend on that transfer. So whether using a staging buffer is a net win is going to depend on the specific CPU and GPU combination, and what your application is doing.

However, on SOCs like mobile devices or integrated GPUs, using a staging buffer should seldom if ever be a win. Mobile GPUs shouldn't have limited device-local + host-visible heap sizes. Looking at a couple Windows integrated GPUs on vulkan.gpuinfo.org, it looks like modern Intel integrated GPUs don't have such limits either, but AMD integrated GPUs still do (I only looked at a few random samples, YMMV).

All this makes it hard to give a clear "always do X" recommendation. Personally, I would generally do this:

  • If I just want one code path that works everywhere and am not worried about performance or memory footprint, use a staging buffer. This is probably a good choice for discrete GPUs, but suboptimal for integrated/SOC GPUs.
  • Otherwise, keep the staging buffer as a fallback path, but use a shared device-local/host-visible pool when there's a big enough one available.
  • When I start trying to get every last bit of performance, then tune the above to prefer staging buffers with asynchronous transfers for some kinds of uploads on discrete GPUs, when I have data showing it's a net win.
Jesse Hall
  • 6,441
  • 23
  • 29
  • So what about dynamic buffers? Do we have to go through the copy operation every frame? I would say the DMA engine wont be as fast as just having access to that 256 MB of device local host visible memory? –  Jul 08 '17 at 17:03
  • 1
    Right, it won't be as fast. But if you care about that, then you're in the second category I mentioned, where you use the shared memory if you can. – Jesse Hall Jul 08 '17 at 18:09
4

Create one big buffer bound to a memory which is device local, host visible and coherent.

Not all devices will have such a memory heap. So if you want to be portable you need to account for that.

If the memory you want/have to put a buffer in is not host visible then there is no other option than to use a staging buffer.

ratchet freak
  • 47,288
  • 5
  • 68
  • 106