In Vulkan (or any other modern graphics API), should fences be waited per queue submission or per frame?

Question

I am trying to set up my renderer in a way that rendering always renders into texture, then I just present any texture I like as long as its format is swapchain compatible. This means that, I need to deal with one graphics queue (I don't have compute yet) that renders the scene, ui etc; one transfer queue that copies the rendered image into swapchain; and one present queue for presenting the swapchain. This is a use-case that I am trying to tackle at the moment but I will be having more use-cases like this (e.g compute queues) as my renderer matures.

Here is a pseudocode on what I am trying to achieve. I added some of my own assumptions here as well:

// wait for fences per frame
waitForFences(fences[currentFrame]);
resetFences(fences[currentFrame]);

// 1. Rendering (queue = Graphics)
commandBuffer.begin();
renderEverything();
commandBuffer.end();

QueueSubmitInfo renderSubmit{};
renderSubmit.commandBuffer = commandBuffer;

// Nothing to wait for
renderSubmit.waitSemaphores = nullptr;

// Signal that rendering is complete
renderSubmit.signalSemaphores = { renderSemaphores[currentFrame] };

// Do not signal the fence yet
queueSubmit(renderSubmit, nullptr);

// 2. Transferring to swapchain (queue = Transfer)

// acquire the image that we want to copy into
// and signal that it is available
swapchain.acquireNextImage(imageAvailableSemaphore[currentFrame]);

commandBuffer.begin();
copyTexture(textureToPresent, swapchain.getAvailableImage());
commandBuffer.end();

QueueSubmitInfo transferSubmit{};
transferSubmit.commandBuffer = commandBuffer;

// Wait for swapchain image to be available
// and rendering to be complete
transferSubmit.waitSemaphores = { renderSemaphores[currentFrame], imageAvailableSemaphore[currentFrame] };

// Signal another semaphore that swapchain
// is ready to be used
transferSubmit.signalSemaphores = { readyForPresenting[currentFrame] };

// Now, signal the fence since this is the end of frame
queueSubmit(transferSubmit, fences[currentFrame]);

// 3. Presenting (queue = Present)
PresentQueueSubmitInfo presentSubmit{};

// Wait until the swapchain is ready to be presented
// Basically, waits until the image is copied to swapchain
presentSubmit.waitSemaphores = { readyForPresenting[currentFrame] };

presentQueueSubmit(presentSubmit);

My understanding is that fences are needed to make sure that the CPU waits until GPU is done submitting the previous command buffer to the queue.

When dealing with multiple queues, is it enough to make the CPU wait only for the frame and synchronize different queues with semaphores (pseudocode above is based on this)? Or should each queue wait for a fence separately?

To get into technical details, what will happen if two command buffers are submitted to the same queue without any semaphores? Pseudocode:

// first submissions
commandBufferOne.begin();
doSomething();
commandBufferOne.end();

SubmitInfo firstSubmit{};
firstSubmit.commandBuffer = commandBufferOne;
queueSubmit(firstSubmit, nullptr);

// second submission
commandBufferTwo.begin();
doSomethingElse();
commandBufferTwo.end();

SubmitInfo secondSubmit{};
secondSubmit.commandBuffer = commandBufferOne;
queueSubmit(secondSubmit, nullptr);

Will the second submission overwrite the first one or will the first FIFO queue be executed before the second one since it was submitted first?

What happens if the GPU has only one queue? Or the presentation engine doesn't support copies into swapchain images? Or there is no queue that can present and cannot perform graphics? — Nicol Bolas, May 15 '22 at 13:25
I am currently using only one queue anyways as in my GPU one queue can do graphics, transfer, and presentation; however, I am not sure what to expect from wide variety of hardware considering the fact that spec does not say anything about how the queues should be defined. — Gasim, May 15 '22 at 14:12
The spec says that all graphics queues can do transfer (and compute) operations. And while GPUs can control which queue families can do presentation, that's not really an issue since presentation doesn't offer a fence to sync with. You just have to make sure that the present is done after submitting the graphics operation. — Nicol Bolas, May 15 '22 at 14:15
I am going to quite the line from the spec here for future reference (I completely missed the first one): "If an implementation exposes any queue family that supports graphics operations, at least one queue family of at least one physical device exposed by the implementation must support both graphics and compute operations." and "All commands that are allowed on a queue that supports transfer operations are also allowed on a queue that supports either graphics or compute operations." — Gasim, May 15 '22 at 14:48

Nicol Bolas · Accepted Answer · 2022-10-25T21:31:51.367

7

This entire organizational scheme seems dubious.

Even ignoring the fact that the Vulkan specification does not require GPUs to offer separate queues for all of these things, you're spreading a series of operations across asynchronous execution, despite the fact that these operations are inherently sequential. You cannot copy from an image to the swapchain until the image has been rendered, and you cannot present the swapchain image until the copy has completed.

So there is basically no advantage to putting these things into their own queues. Just do all of them on the same queue (with one submit and one vkQueuePresentKHR), using appropriate execution and memory dependencies between the operations. This means there's only one thing to wait on: the single submission.

Plus, submit operations are really expensive; doing two submits instead of one submit containing both pieces of work is only a good thing if the submissions are being done on different CPU threads that can work concurrently. But binary semaphores stop that from working. You cannot submit a batch that waits for semaphore A until you have submitted a batch that signals semaphore A. This means that the batch signaling must either be earlier in the same submit command or must have been submitted in a prior submit command. Which means if you put those submits on different threads, you have to use a mutex or something to ensure that the signaling submit happens-before the waiting submit.¹

So you don't get any asynchronous execution of the queue submit operation. So neither the CPU nor the GPU will asynchronously execute any of this.

¹: Timeline semaphores don't have this problem.

As for the particulars of your technical question, if operation A is dependent on operation B, and you synchronize with A, you have also synchronized with B. Since your transfer operation is waits on a signal from the graphics queue, waiting on the transfer operation will also wait on graphics commands from before that signal.

edited Oct 25 '22 at 21:31

answered May 15 '22 at 13:42

Nicol Bolas

449,505
63
781
982

I undertand what you mean and currently, I have one queue with one submission that submits everything at once and presents it afterwards. However, I don't always want to render things without presenting them. My current system with one render graph, one queue submission, and present makes it really complex and cumbersome to do any kind of "one-time" render operations, which I need to utilize a lot. This is why I am trying to separate rendering completely from presenting. – Gasim May 15 '22 at 14:30
The number of queue submit operations is more important than whatever else it is you're prioritizing. Just figure out if you need to create a CB to copy the result image to the presentable one. If you do, add it to the submit operation. – Nicol Bolas May 15 '22 at 15:03
Why is queue submission expensive? We are talking about submitting two instead of one queue submissions and the second queue submission consists of 3-4 commands (barriers + copy command). When the commands are being recorded, there is no CPU intervention between these commands; so, from my understanding, I don't even need a fence here. What contributes to the cost of queue submission? – Gasim May 15 '22 at 16:29
@Gasim: "*Why is queue submission expensive?*" Does it really matter? The Vulkan spec stops in the middle of the documentation for `vkQueueSubmit` to specifically deliver a warning about its performance and advises you to use the function as little as possible. The function itself facilitates this by being able to take multiple command buffers and multiple *batches* of command buffers. "Why" in this context is philosophical, since there's nothing *you* can do about it either way. – Nicol Bolas May 15 '22 at 16:48
I was mainly asking to just understand if it is dependent on command buffer size or something else. Nvidia's Vulkan Do's and Don't says to aim for 5-10 queue submissions per frame while also says to minimize it as much as possible. I understand that multiple queue submissions might not be optimal but if I am going to lose `<1ms` frame time on this, the simplity and flexibility it will bring might be worth the tradeoff. I am going to profile one vs two submissions and make a decision based on performance impact. – Gasim May 15 '22 at 17:22
1

@Gasim: What is the "simplicity" here? At some point, you make the decision whether to copy to a swapchain image or not. Just put that decision *before* your submission instead of after. I fail to see how either is "simpler". Same goes for "flexibility"; how is one more "flexible"? – Nicol Bolas May 15 '22 at 17:30
1

Now that I think about it, I think you are right! Each render operation (call to `render`) and present operation will have different command buffers per frame. So, I can technically batch them together and send them all at once at the end of frame. – Gasim May 15 '22 at 18:12

In Vulkan (or any other modern graphics API), should fences be waited per queue submission or per frame?

1 Answers1