How should Staging Buffer be used performance-wise properly?

Question

How should the Vulkan API's staging buffer be used properly? Because saving your vertex data into the staging buffer, than copied to the vertex buffer in the GPU, seems to be taking a longer step than just directly submitting your vertices to your vertex buffer. This is a Minecraft-clone program so there will be a lot of vertex data (with index data too) and dynamic chunk loading, so is there any other kinds of buffer or a method of buffering that might benefit from this?

Even then using a separate device threading or even cross-device threading seems to be slower than just directly submitting your vertices to the vertex buffer on the fly. And I currently do not clearly understand the pros and cons of using the traditional direct vertex buffer versus the staging buffer.

The tutorial I'm currently following uses Staging Buffer once before the drawings and presentation. There seems to be a lack of forums or articles discussing the problem precisely described above.

"*just directly submitting your vertices to the vertex buffer on the fly.*" What exactly do you mean by that? You can't "submit vertices" to a buffer from the CPU unless that buffer is in memory *writable by* the CPU. And if it is CPU writable... why are you using a staging buffer? — Nicol Bolas, Nov 18 '20 at 05:57
I don't really know the correct word to use for "submitting" the vertices to the device. But I'm saying when you record the command buffer, you place the vertex buffer to the `vkCmdBindVertexBuffers` and gets run on every frame?? — entropy32, Nov 18 '20 at 19:07

Nicol Bolas · Answer 1 · 2022-09-08T21:55:29.810

The exact mechanics one would use to achieve high performance would depend heavily on both the details of the hardware and the expected frequency of data updates.

Staging buffers are only relevant for GPUs that have multiple pools of device memory, called heaps. Integrated GPUs typically only have one heap, so there's no point in staging vertex data (textures still need staging because of tiling).

So in a device with more than one heap, the first thing you need to find out is what your options are. In multi-memory GPUs (aka: GPUs that have their own memory), one or more of the heaps will be marked DEVICE_LOCAL. This is meant to represent memory which has the fastest possible access time for GPU operations.

But device-local memory is (usually) not directly accessible from the CPU; hence the need for staging.

However, memory that isn't device-local may be able to be used for GPU tasks. That is, a GPU may be able to read from CPU-accessible memory directly for certain kinds of operations. You can ask if a particular memory type can be used as the source memory for vertex data.

If CPU-accessible, non-device-local memory can be used for vertex data, then you now have a real choice: which heap to read from? Do you read vertex data across the PCI-e bus? Or do you transfer the vertex data to faster memory, then read from it?

In cases where vertex data is being dynamically generated every frame, I would probably say that you should default to not staging (but you should still profile it on applicable hardware).

And if you were doing some kind of data streaming, where you're loading world data as the camera moves through a scene, then I would say that the best thing to do would be to transfer it to device-local memory. You're using the data much more frequently than you're doing transfer operations, and you should be able to dump whole chunks of data in a few transfer calls.

But your use case is about intermittent data generation. Even if a player is constantly placing blocks, you're likely to be reusing the same data to render several frames. And even when a block is placed, you're only changing that one subsection of data. Each transfer operation has some degree of overhead to it, so doing a bunch of small transfers can make things sluggish.

As such, it's hard to say which is better; you should profile it on a variety of hardware to see what the performance is like.

Also, be advised that some discrete GPUs will have a special heap of around 256MB. It's special because it is both CPU-accessible and device-local. Presumably, there is some fast channel for the CPU to write its data to this device memory. This memory heap is designed for streaming usage, so it would be pretty good for your needs (assuming its size is adequate; the size tends to be around 256MB regardless of the GPU's total memory).

To be more precise, I mean like there won't necessarily be a data generation between each frame, but more so conditionally with varying size. Like when player's are moving thus new chunk loading and big chunk of data generation or when the player's destroy or place a single block thus a small new data generation. So I was thinking is it possible to use stage buffer *and* use CPU accessible buffer conditionally. This looks to be really unnecessarily complicated though. I am running on a discrete NVidia GPU if that helps. — entropy32, Nov 18 '20 at 19:23
I believe you are using the word "pool" instead of the term "heap" which is the term used in the Vulkan spec, am I right? You mentioned that AMD HW tends to has a device-local host-visible "pool". Apparently Nvidia has it too: http://www.vulkan.gpuinfo.org/displayreport.php?id=16387#memory — tuket, Sep 08 '22 at 19:15

score 1 · Answer 2 · answered Nov 18 '20 at 11:33

1

In short and simple:

Staging is just a way to copy data from CPU to GPU memory
If data is in CPU memory, it is read over PCI-e and cached by GPU at access time
PCI-e has relatively limited throughput and some latency
Mindlessly copying whole thing every frame is useless, and will add overhead.

So:

If memory is often changed by CPU (like camera matrices, player position) - then let it be in CPU memory
If data is rarely/only small parts of it modified by CPU - then stage it to GPU

In your case:

The buffer with voxel data is modified: at chunk loading time and when player modifies world. So first case is rare - use asynchronous copy when it happens, second case - small and rare - if you record your command buffers everyframe, you can even sneak that copy in there...

Nicol Bolas has more in-detail answer

By the way, if this is a minecraft clone, why use a lot of indices? Just record few shapes for different blocks and then use instancing to store world positions and texture IDs for those. Use a draw command per different shape that you draw. Instancing is trivial to implement in Vulkan. You can even use compute shaders to generate world data almost instantly!

answered Nov 18 '20 at 11:33

user369070

600
1
5
9

I used indices because I thought I can save some vertex buffer size using the smaller sized indices. A quad requires two triangles: so there will be 6 vertices taking 6 times whatever the size will be (likely larger than the size of each index). So I thought using indices can save 2 vertices with n size with a significantly smaller index size (u32). There will be a lot of vertices so I thought this may help a lot on the memory usage. – entropy32 Nov 18 '20 at 19:12
What I meant is - you said there will be **a lot** of vertex and index data, while using instancing you could store only one cube of vertices and indices (8v+36i not a lot) and use instanced data to differentiate between those. Unless you have some sort of elaborate optimisation that needs individual vertex data? – user369070 Nov 18 '20 at 19:23
Sorry, I mean **a lot** of vertex and **maybe** a lot of index data. So if I use instancing, the only thing I need to pass in other than the cube and the indices itself is the position. And so the baked lighting data, textures, all other cube data can be stored in the main vertex buffer? – entropy32 Nov 18 '20 at 19:27
Baked lights... In a dynamic environment? Are you sure this is a good idea? If so - then you probably will need separate vertices. Texture coordinates are easier, since on same geometry they are pretty much the same, and to pick different textures you could just offset or layer them with data passed through instance data, since every block of same type has same texture. – user369070 Nov 18 '20 at 19:33
Oh, by the way, you could put the copy buffer command in your draw command, then a memory barrier. Each time you record it - you tell what parts of memory to copy. But beware that this may and most likely will cause stutter. Also vkBindVertexBuffers doesnt copy anything - it just tells the next draw commands to use that buffer, so you can switch between vertex buffers used inside one command buffer – user369070 Nov 18 '20 at 19:41
Ok thanks! I am planning to only use baked 4-state lightings because that actually *significantly* improves the appearances, but at most I won't go to smooth lightings but I may use dynamic lightning in the future if needed. I'm just going to try using CPU accessible buffer, if it is slow I might try staging buffer if that helps any bit. – entropy32 Nov 18 '20 at 19:53

How should Staging Buffer be used performance-wise properly?

2 Answers2