10

I've read several articles on the CPU-GPU (using fences) and GPU-GPU (using semaphores) synchronization mechanisms, but still got trouble to understand how I should implement a simple render-loop.

Please take a look at the simple render() function below. If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.

However, is this really safe? All operations are asynchronous. So, is it really safe to "reuse" the image_available semaphore in a subsequent call of render() again even though the signal request from the previous call hasn't fired yet? I would think it's not, but, on the other hand, we're using the same queues (don't know if it matters where the graphics and presentation queue are actually the same) and operations inside a queue should be sequentially consumed ... But if I got it right, they might not be consumed "as a whole" and could be reordered ...

The second thing is that (again, unless I'm missing something) I clearly should use one fence per swap chain image to ensure that the operation on the image corresponding to the image_index of the call to render() has finished. But does that mean that I necessarily need to do a

if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
    throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);

before my call to vkAcquireNextImageKHR? And do I then need dedicated image_available and rendering_finished semaphores per swap chain image? Or maybe per frame? Or maybe per command buffer/pool? I'm really confused ...


void render()
{
    std::uint32_t image_index;
    switch (vkAcquireNextImageKHR(device(), swap_chain().handle(),
        std::numeric_limits<std::uint64_t>::max(), m_image_available, VK_NULL_HANDLE, &image_index))
    {
    case VK_SUBOPTIMAL_KHR:
    case VK_SUCCESS:
        break;
    case VK_ERROR_OUT_OF_DATE_KHR:
        on_resized();
        return;
    default:
        throw std::runtime_error("vkAcquireNextImageKHR");
    }

    static VkPipelineStageFlags constexpr wait_destination_stage_mask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;

    VkSubmitInfo submit_info{};
    submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;

    submit_info.waitSemaphoreCount = 1;
    submit_info.pWaitSemaphores = &m_image_available;
    submit_info.signalSemaphoreCount = 1;
    submit_info.pSignalSemaphores = &m_rendering_finished;

    submit_info.pWaitDstStageMask = &wait_destination_stage_mask;

    if (vkQueueSubmit(graphics_queue().handle, 1, &submit_info, VK_NULL_HANDLE) != VK_SUCCESS)
        throw std::runtime_error("vkQueueSubmit");

    VkPresentInfoKHR present_info{};
    present_info.sType = VK_STRUCTURE_TYPE_PRESENT_INFO_KHR;

    present_info.waitSemaphoreCount = 1;
    present_info.pWaitSemaphores = &m_rendering_finished;

    present_info.swapchainCount = 1;
    present_info.pSwapchains = &swap_chain().handle();
    present_info.pImageIndices = &image_index;

    switch (vkQueuePresentKHR(presentation_queue().handle, &present_info))
    {
    case VK_SUCCESS:
        break;
    case VK_ERROR_OUT_OF_DATE_KHR:
    case VK_SUBOPTIMAL_KHR:
        on_resized();
        return;
    default:
        throw std::runtime_error("vkQueuePresentKHR");
    }
}

EDIT: As suggested in the answers below, assume we have k "frames in flight" and hence k instances of the semaphores and the fence used in the code above, which I will denote by m_image_available[i], m_rendering_finished[i] and m_fence[i] for i = 0, ..., k - 1. Let i denote the current index of the frame in flight, which is increased by 1 after each invocation of render(), and j denote the number of invocations of render(), starting from j = 0.

Now, assume the swap chain contains three images.

  • If j = 0, then i = 0 and the first frame in flight is using swap chain image 0
  • In the same way, if j = a, then i = a and the ath frame in flight is using swap chain image a, for a= 2, 3
  • Now, if j = 3, then i = 3, but since the swap chain image only has three images, the fourth frame in flight is using swap chain image 0 again. I wonder whether this is problematic or not. I guess it's not, since the wait/signal semaphores m_image_available[3]/m_rendering_finished[3], used in the calls of vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR in this invocation of render(), are dedicated to this particular frame in flight.
  • If we reach j = k, then i = 0 again, since there are only k frames in flight. Now we potentially wait at the beginning of render(), if the call to vkQueuePresentKHR from the first invocation (i = 0) of render() hasn't signaled m_fence[0] yet.

So, besides my doubts described in the third bullet point above, the only question which remains is why I shouldn't take k as large as possible? What I theoretically could imagine is that if we are submitting work to the GPU in a quicker fashion than the GPU is able to consume, the used queue(s) might continually grow and eventually overflow (is there some kind of "max commands in queue" limit?).

Elad Maimoni
  • 3,703
  • 3
  • 20
  • 37
0xbadf00d
  • 17,405
  • 15
  • 67
  • 107

2 Answers2

4

If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.

Yes, you got it right. You submit the desire to get a new image to render into via vkAcquireNextImageKHR. The presentation engine will signal the m_image_available semaphore as soon as an image to render into has become available. But you have already submitted the instruction.

Next, you submit some commands to the graphics queue via submit_info. I.e. they are also already submitted to the GPU and wait there until the m_image_available semaphore receives its signal.

Furthermore, a presentation instruction is submitted to the presentation engine that expresses the dependency that it needs to wait until the submit_info-commands have completed by waiting on the m_rendering_finished semaphore.

I.e. everything has been submitted. If nothing has been signalled yet, everything just sits there in some GPU buffers and waits for signals.

Now, if your code loops right back into the render() function and re-uses the same m_image_available and m_rendering_finished semaphores, it will only work if you are very lucky, namely if all the semaphores have already been signalled before you use them again.

The specifications says the following for vkAcquireNextImageKHR:

If semaphore is not VK_NULL_HANDLE it must not have any uncompleted signal or wait operations pending

and furthermore, it says under 7.4.2. Semaphore Waiting

the act of waiting for a binary semaphore also unsignals that semaphore.

I.e. indeed, you need to wait on the CPU until you know for sure that the previous vkAcquireNextImageKHR that uses the same m_image_available semaphore has completed.

And yes, you already got it right: You need to use a fence for that which you pass to vkQueueSubmit. If you do not synchronize on the CPU, you'll shovel ever more work to the GPU (which is a problem) and the semaphores that you are re-using might not get properly unsignalled in time (which is a problem).

What is often done is that the semaphores and fences are multiplied, e.g. to 3 each, and these sets of synchronization objects are used in sequence, so that more work can be parallelized on the GPU. The Vulkan Tutorial describes this quite nicely in its Rendering and presentation chapter. It is also explained with animation in this lecture starting at 7:59.

j00hi
  • 5,420
  • 3
  • 45
  • 82
  • Is there a reason why I shouldn't always "multiply" the semaphores and the fence by the count of swap chain images? – 0xbadf00d Nov 29 '20 at 12:10
  • They do not have 1:1. If you have more _frames in flight_ than swapchain images, and those frames in flight render to some intermediate buffers, the GPU can parallelize more work and present as soon as a swapchain image becomes available. Put the other way round: If you have more swapchain images than frames in flight, the presentation engine can work on presenting the images while you already render into an image that is currently not used by the presentation engine (and maybe you have found out that you do not need to have more than, e.g. two, frames in flight to fully utilize your GPU). – j00hi Nov 29 '20 at 12:34
  • Please take note of my edit. Would be great if you could say if I got right and elaborate on the few aspects which I wrote to be still unsure about. – 0xbadf00d Nov 30 '20 at 11:57
  • What your edit suggest would only work if `vkAcquireNextImageKHR` always returned image indices strictly in the order `0` -> `1` -> `2` -> `0` -> `1` -> `2` -> etc. That can not be guaranteed. For this reason, you also need to make sure to handle cases where `vkAcquireNextImageKHR` might not return image indices in that order. [Vulkan Tutorial](https://vulkan-tutorial.com/Drawing_a_triangle/Drawing/Rendering_and_presentation#page_Conclusion) elaborates on such cases => search for `imagesInFlight`! – j00hi Nov 30 '20 at 15:14
  • 1
    A reason why you would not like to have `k` as big as possible is that you do not always want your GPU to run ahead so much. Imagine that you are packing input information to your command submission. Now imagine that you have already submitted so much work to the GPU that it can produce so many frames that the next whole second can be rendered with the data already submitted => Your input latency is now 1 second, and that would result in an unplayable game. Furthermore, usually two or three frames in flight suffice to fully utilize your GPU --- this is at least true for bigger applications. – j00hi Nov 30 '20 at 15:18
  • Don't we need more generally to ensure that a framebuffer (which consists in the example in the question of a single swap chain image) which is used from a command buffer which is submitted to a queue is not used again in a subsequent submit before the previous one has finished working (assuming, of course, that the images inside the framebuffer are not used as an input resource, but are overwritten by the command buffers)? – 0xbadf00d Dec 03 '20 at 14:52
  • The images you render into are strictly tied to a framebuffer. I.e. whenever I wrote about frames in flight which _concurrently_ render into multiple different _images in flight_, that implies that there are the same number of _framebuffers in flight_, so to speak. I.e. you would setup as many framebuffers as there are images to be rendered into concurrently. (At least for graphics pipelines, where you need framebuffers.) The fences ensure that subsequent submits do not render into the same framebuffer/image before the previous submit has finished by waiting on the CPU. – j00hi Dec 04 '20 at 09:13
  • "render into the same framebuffer/image" simply means that we've got attachements to which at least one of the involved shaders write, right? Those attachements need the "special protection" using fences. But all other attachements which are only read by shaders don't need something like that. And the same should apply to compute pipelines or am I missing something? – 0xbadf00d Dec 05 '20 at 14:23
  • 1
    Yes, that's about right. Synchronization is a very broad topic in Vulkan. I can recommend two sources if you'd like to make yourself more familiar with it: [Yet another blog explaining Vulkan synchronization](http://themaister.net/blog/2019/08/14/yet-another-blog-explaining-vulkan-synchronization/) and the [Introduction to Vulkan](https://youtu.be/isbMMIwmZes) lecture from [22:30](https://youtu.be/isbMMIwmZes?t=1350) onwards. – j00hi Dec 05 '20 at 16:03
  • Besides the attachments, we also need to ensure that we are not reusing a command pool for which there is at least one command buffer currently being consumed by the GPU (maybe, I'm not sure about this, we only need to make sure that any such command buffer is not being reused). So, it seems like I also need a fence per command pool (or per command buffer "batch"). But since we can only specify a single fence in vkQueueSubmit and here we are already using the fence from the "frames in flight" struct, I have no idea how we need to do that. Am I missing something? – 0xbadf00d Dec 11 '20 at 05:54
  • 1
    There's no problem with re-using command pools. In fact, they should be re-used. Don't always create a new one! Only the commands themselves must have completed execution before they can be re-used (unless they have been created with the `VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT`). Command pools can be thought of CPU-only operations. There are no instructions sent to the GPU which would need to be synchronized. – j00hi Dec 11 '20 at 06:56
  • I guess I'm confused with the way "command allocators" are used in D3D12. There you might have an array of command allocators and reset a "command list" to the next allocator after submission. But it seems to me that in Vulkan we would only use a single command pool (maybe per thread) and have an array of command buffers instead, right? It's still not clear to me how I know whether command buffer execution has completed, since the single fence I can specify in vkQueueSubmit is already used from the frames in flight struct. – 0xbadf00d Dec 11 '20 at 07:15
  • Please take a look at my follow-up question: https://stackoverflow.com/q/65293272/547231. – 0xbadf00d Dec 14 '20 at 17:05
2

So first of all, as you mentioned correctly, semaphores are strictly for GPU-GPU synchronization, e.g. to make sure that one batch of commands (one submit) has finished before another one starts. This is here used to synchronize the rendering commands with the present command such that the presenting engine knows when to present the rendered image.

Fences are the main utility for CPU-GPU synchronization. You place a fence in a queue submit and then on the CPU side wait for it before you want to proceed. This is usually done here such that we do not queue any new rendering/present commands while the previous frame hasn't finished.

But does that mean that I necessarily need to do a

if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
    throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);

before my call to vkAcquireNextImageKHR?

Yes, you definitely need this in your code, otherwise your semaphores would not be safe and you would probably get validation errors.

In general, if you want your CPU to wait until your GPU has finished rendering of the previous frame, you would have only a single fence and a single pair of semaphores. You could also replace the fence by a waitIdle command of the queue or device. However, in practice you do not want to stall the CPU and in the meantime record commands for the next frame. This is done via frames in flight. This simply means that for every frame in flight (i.e. number of frames that can be recorded in parallel to the execution on the GPU), you have one fence and one pair of semaphores which synchronize that particular frame.

So in essence for your render loop to work properly you need a pair of semaphores + fence per frame in flight, independent of the number of swapchain images. However, do note that the current frame index (frame in flight) and image index (swapchain) will generally not be the same except you use the same amount of swapchain images as frames in flight. This is because the presenting engine might give you swapchain images out of order depending on your presenting mode.

Firnor
  • 98
  • 6
  • 1
    Is there a reason why I shouldn't take the number of frames in flight and swap chain images equal? To me, it seems like a fence is "protecting" a single image, so there seems to be a 1:1 relationship. – 0xbadf00d Nov 29 '20 at 12:12
  • Of course you could but it really depends on your application. For example, increasing the number of frames in flight increases the potential to utilize GPU better (since is enoug work provided) but also increases input lag. For swap chain images, in a regular application, you have 2 or 3, but only in very rare occasions you have more (I can't even think of one). – Firnor Nov 29 '20 at 12:17
  • Regarding your thought that a fence protects a single image: That is true but it only needs protection because multiple command submissions want to use the same resource. If you only have one frame in flight, i.e. full sync after submission, then you can have multiple images but still use only a single fence. – Firnor Nov 29 '20 at 12:25
  • Please take note of my edit. Would be great if you could say if I got right and elaborate on the few aspects which I wrote to be still unsure about. – 0xbadf00d Nov 30 '20 at 11:57
  • 1
    I'd like to refer to the latest comment of J00hi in the answer above: In summary, swapchain image indices must not be in order and thus there is no 1:1 mapping from frame in flight index to swapchain image index. Higher number of frames in flight correspond to higher input latency. – Firnor Dec 01 '20 at 12:25
  • Please take a look at my follow-up question: https://stackoverflow.com/q/65293272/547231. – 0xbadf00d Dec 14 '20 at 17:05