Underlying hardware mapping of Vulkan queues

Question

Vulkan is intended to be thin and explicit to user, but queues are a big exception to this rule: queues may be multiplexed by driver and it's not always obvious if using multiple queues from a family will improve performance or not.

After one of driver updates, I've got 2 transfer-only queues instead of one, but I'm pretty sure that there will be no benefit in using them in parallel for data streaming compared to just using one of them (will be happy to be proved wrong)

So why not just say "we have N separate hardware queues and if you want to use some of them in parallel, just mutex it yourself"? Now it looks like there's no way to know, how independent queues in family really are.

Somewhat related Issue is open at the spec repo: https://github.com/KhronosGroup/Vulkan-Docs/issues/569 — krOoze, Nov 08 '19 at 16:56

score 7 · Accepted Answer · answered Nov 08 '19 at 14:41

GPUs these days have to contend with a multi-processed world. Different programs can access the same hardware, and GPUs have to be able to deal with that. As such, having parallel input streams for a single piece of actual hardware is no different from being able to create more CPU threads than you have actual CPU cores.

That is, a queue from a family is probably not "mutexing" access to the actual hardware. At least, not in a CPU way. If multiple queues from a family are different paths to execute stuff on the same hardware, then the way that hardware gets populated from these multiple queues probably happens at the GPU level. That is, it's an actual hardware feature.

And you could never get performance equivalent to that hardware feature by "mutexing it yourself". For example:

I've got 2 transfer-only queues instead of one, but I'm pretty sure that there will be no benefit in using them in parallel for data streaming compared to just using one of them

Let's assume that there really is only one hardware DMA channel with a fixed bandwidth behind that transfer queue. This means that, at any one time, only one thing can be DMA'd from CPU memory to GPU memory at one time.

Now, let's say you have some DMA work to do. You want to upload a bunch of stuff. But every now and then, you need to download some rendering product. And that download needs to complete ASAP, because you need to reuse the image that stores those bytes.

With prioritized queues, you can give the download transfer queue much higher priority than the upload queue. If the hardware permits it, then it can interrupt the upload to perform the download, then get back to the upload.

With your way, you'd have to upload each item one at a time at regular intervals. A process that will have to be able to be interrupted by a possible download. To do that, you'd basically have to have a recurring tasks that shows up to perform and submit a single upload to the transfer queue.

It'd be much more efficient to just throw the work at the GPU and let its priority system take care of it. Even if there is no priority system, then it'll probably perform operations round-robin, jumping back and forth between the input transfer queue operations rather than waiting for one queue to run dry before trying another.

But of course, this is all hypothetical. You'd need to do profiling work to make sure that these things pan out.

The main issue with queues within families is that they sometimes represent distinct hardware with their own dedicated resources and sometimes they don't. AMD's hardware for example offers two transfer queues, but these actually use separate DMA channels. Granted, they probably still share the same overall bandwidth, but it's not a simple case of one queue having to wait to execute work until the other queue has executed a transfer command.

But if GPU has to deal with parallel command streams anyway, why do we have a limit for a max number of queues that can be created from a queue family? For example, when I have queue family that supports creating 16 queues, I can create 2 VkDevices, each with 16 queues. This limit doesn't mean that GPU can handle at most 16 queues, but not more than 16 per logical device — YaaZ, Nov 08 '19 at 15:26
@YaaZ: But what if there are only 16 actual dispatch units for work? That means that those devices will actually be talking to the same dispatch units, both submitting work for them. And whatever trickery is happening within the logical device to make that work, it's still better than encouraging users of a *single* logical device to create more queues than the actual hardware can support. That is, you might have to share between devices, but you shouldn't have sharing within a device. — Nicol Bolas, Nov 08 '19 at 15:36
But that's exactly what's going on with Nvidia GPUs and their 16 graphics queues mapping to single graphics frontend! Aren't they encouraging users to create more queues that the hardware can support? — YaaZ, Nov 08 '19 at 16:00
@YaaZ: Are you talking about the underlying execution resources or the actual command dispatching hardware? Because the whole point of this conversation is that these aren't the same thing. — Nicol Bolas, Nov 08 '19 at 16:07
I have never thought of it as of different things, this clarifies a lot. So do I understand it correctly that distinct VkQueues often maps to separate hardware dispatchers, which often shares underlying execution resources? Thanks! — YaaZ, Nov 08 '19 at 16:48
@YaaZ: You really should read the page I linked to. It goes into details about when multi-queue operation is useful and how they typically map to the hardware. — Nicol Bolas, Nov 08 '19 at 16:52

Underlying hardware mapping of Vulkan queues

1 Answers1