Multi-thread rendering vs command pools

Question

After all, being able to build command buffers in parallel is one of the selling points of Vulkan.

Specs (5.1 Command Pools) (emphasis mine):

Command pools are application-synchronized, meaning that a command pool must not be used concurrently in multiple threads. That includes use via recording commands on any command buffers allocated from the pool, as well as operations that allocate, free, and reset command buffers or the pool itself.

Doesn't this kind of kill the whole purpose of command pools when it comes to recording in parallel? If you intend to record in parallel, then you would better be off having a separate pool for each thread, isn't that right?

I would understand it if if you pre-record command buffers allocated all from the same pool (in one thread) and then execute them in parallel. That has the advantage of amortized resource creation costs as well as parallel execution. However, parallel recording and command pools don't seem to match very well.

I don't personally know why you wouldn't just pre-record everything. So why is building command buffers in parallel so needed? And would you then really have to use one pool per thread?

Love the way you just put Nicol Bolas as the first line - he seems to be able to answer every question about Vulkan that I can think up. Having to use a separate pool per thread does not kill the purpose of command buffers - in what situation does the pool that the command buffer is allocated from restrict the set of commands you can put into the buffer? — Andrew Williamson, Jul 12 '16 at 01:59
@AndrewWilliamson, very bad typo, I meant _the purpose of command **pools**_. — Shahbaz, Jul 12 '16 at 02:30
Each thread may still be creating and destroying command buffers very frequently. The pools are there for that scenario. — Andrew Williamson, Jul 12 '16 at 02:36
@AndrewWilliamson, why would they do that? In which scenario would that be better than reusing command buffers? — Shahbaz, Jul 12 '16 at 19:18
@Shahbaz It might be easier not to care about resetting and reusing command buffers. Allocating new command buffers is cheap as long as the pool does not need to reach down to the host to acquire more memory. Maybe this article helps: https://community.arm.com/groups/arm-mali-graphics/blog/2016/04/19/massively-multi-thread-for-vulkan In that article, multiple pools are used per thread, each assigned to a rendered frame. After a frame is done rendering, the corresponding pool is reset and ready to be re-used. — Manuzor, Oct 29 '16 at 23:31

Nicol Bolas · Accepted Answer · 2016-07-12T02:38:47.663

If you intend to record in parallel, then you would better be off having a separate pool for each thread, isn't that right?

I don't see how having a separate pool per thread "kills the whole purpose of command pools when it comes to recording in parallel". Indeed, it helps it quite a bit, since each thread can manage its own command pool as it sees fit.

Consider the structural difference between, say, a descriptor pool and a command pool. With a descriptor pool, you basically tell it exactly what you will allocate from it. VkDescriptorPoolCreateInfo provides detailed information which allows implementations to allocate up-front exactly how much memory you'll use for each pool. And you cannot allocate more than this from a descriptor pool.

By contrast, VkCommandPoolCreateInfo contains... nothing. Oh, you tell it if the command buffers can be primary or secondary. You say whether the command buffers will be frequently reset or persistent. And a couple of other things. But other than that, you say nothing about the contents of the command buffers. You don't even give it information on how many buffers you'll allocate.

Descriptor pools are intended to be fixed: allocated as needed, but up to a quantity set at construction time. Command buffers are intended to be very dynamic: allocated from as needed for your particular use cases.

Think of it as each pool having its own malloc/free. Since the user is forced to synchronize access to pools and their buffers, that means that every vkCmd* function is not required to do so when they allocate memory. That makes command building faster. That helps threading. When a thread decides to reset its command pool, it doesn't have to lock any mutexes or any other such stuff to do that.

There's nothing conceptually wrong with having one command pool per thread. Indeed, having two per thread (double-buffering) makes even more sense.

I don't personally know why you wouldn't just pre-record everything.

Because you're not making a static tech demo.

I guess this comes from lack of experience, but I imagined the parallel-recording would look like "threads 2-N record secondary command buffers, thread 1 calls all of them in one primary command buffer", in which case there is only one command buffer per thread. That was why I said it kills the purpose of command pools, because you are only making a single allocation per pool.

That's certainly a viable form of recording command buffers in parallel. But there are two things you've missed.

While that is certainly one form of parallel recording, it is not the only one. If you're doing deferred rendering, the thread that builds the CB for the lighting passes will be finished with its work much sooner than one of the threads that's responsible for (part of) the geometry pass. So a well-designed multithreaded system will have to apportion out work to threads based on need, not based on some fixed arrangement of stuff. So an individual thread will often end up building multiple command buffers.

And even if that were not the case, you forget about buffering. When it comes time to build the CBs for the next frame, you can't just overwrite the existing ones. After all, they're probably still in the queue doing work. So each thread will need at least two CBs; the one that's currently being executed and the one that's currently being built.

And even if that were not the case, command pools allocate all memory associated with a CB. There's a reason why I analogized them to malloc/free. Even if you only use a single CB with a particular pool, the fact that this CB's allocations (which can happen due to any vkCmd* function) never have to synchronize with another thread is a good thing.

So no, this does not in any way inhibit the ability to use multiple threads to build CBs.

Typo in my post: _the whole purpose of command **pools**_, sorry about that. — Shahbaz, Jul 12 '16 at 02:25
I guess this comes from lack of experience, but I imagined the parallel-recording would look like "threads 2-N record secondary command buffers, thread 1 calls all of them in one primary command buffer", in which case there is only one command buffer per thread. That was why I said it kills the purpose of command pools, because you are only making a single allocation per pool. — Shahbaz, Jul 12 '16 at 02:29
_"[...] the fact that this CB's allocations [...] never have to synchronize with another thread is a good thing."_ But is that entirely true? At some point the command pool, which is essentially the memory manager for the command buffers it spawned, has to grab more host memory. This means that *at least the host memory allocator* needs to be synchronized, right? Otherwise I don't see how command pools can work independently from another. In the standard case this would be `malloc` I assume, which is synchronized, but you still need to manually sync when providing a custom allocator. Right? — Manuzor, Oct 29 '16 at 19:34
@Manuzor: "*you still need to manually sync when providing a custom allocator. Right?*" ... do you? Allocations scoped to the command pool's object should not need to be synchronized. So if you want, you can create an allocator that minimizes such synchronization, limiting it solely to getting pages of RAM from the OS. — Nicol Bolas, Oct 29 '16 at 21:37
@NicolBolas That's what I meant. At some level, you need to synchronize the memory allocation. I just wanted to clarify (mainly for myself) that using command pools on threads is not sync-free per se but rather minimizes it. Especially compared to manual synchronization of each command buffer operation (ouch!). Thanks for the quick answer, btw! — Manuzor, Oct 29 '16 at 23:23
@Manuzor, you pay the cost of synchronization once when you allocate the pool memory (perhaps automatically done in `malloc` if it uses that, or automatically done by the kernel if they just get pages for example), but the point is you don't need synchronization during actual time-critical execution. — Shahbaz, Oct 19 '17 at 15:24

krOoze · Answer 2 · 2016-07-12T17:32:18.340

If you intend to record in parallel, then you would better be off having a separate pool for each thread, isn't that right?

It is exactly right. That is what your spec quote implies.

I would understand it if if you pre-record command buffers allocated all from the same pool (in one thread) and then execute them in parallel.

Vulkan does one better. You can pre-record command buffers (allocated from per-thread pools) in parallel and then execute them in parallel too (if your workload is conducive to that).

I don't personally know why you wouldn't just pre-record everything. So why is building command buffers in parallel so needed?

Because it's hard (especially as your app grows in complexity). At some point even contra-productive (when you twist the CmBs to be pre-recordable - e.g. filling it with empty placeholder bindings from which 80 % of them won't be used).
It is not necessarily "needed", Vulkan just lets you choose what you deem is best for your App (or part of it).

"*It is not necessarily needed, Vulkan just gives you that extra possibility.*" Even better, Vulkan lets you *choose*. You can pre-record some things (post-processing filters, presenting images, etc) while dynamically filling others. — Nicol Bolas, Jul 12 '16 at 17:14

Multi-thread rendering vs command pools

2 Answers2

Linked