Why is a single depth buffer sufficient for this vulkan swapchain render loop?

Question

I was following the vulkan tutorial at https://vulkan-tutorial.com/ and at the depth buffering chapter, the author Alexander Overvoorde mentions that "We only need a single depth image, because only one draw operation is running at once." This is where my issue comes in.

I've read many SO questions and articles/blog posts on Vulkan synchronization in the past days, but I can't seem to reach a conclusion. The information that I've gathered so far is the following:

Draw calls in the same subpass execute on the gpu as if they were in order, but only if they draw to the framebuffer (I can't recall exectly where I read this, it might have been a tech talk on youtube, so I am not 100% sure about this). As far as I understood, this is more GPU hardware behavior than it is Vulkan behaviour, so this would essentially mean that the above is true in general (including across subpasses and even render passes) - which would answer my question, but I can't find any clear information on this.

The closest I've gotten to getting my question answered is this reddit comment that the OP seemed to accept, but the justification is based on 2 things:

"there is a queue flush at the high level that ensures previously submitted render passes are finished"
"the render passes themselves describe what attachments they read from and write to as external dependencies"

I see neither any high level queue flush (unless there is some sort of explicit one that I cannot find for the life of me in the specification), nor where the render pass describes dependencies on its attachments - it describes the attachments, but not the dependencies (at least not explicitly). I have read the relevant chapters of the specification multiple times, but I feel like the language is not clear enough for a beginner to fully grasp.

I would also really appreciate Vulkan specification quotes where possible.

Edit: to clarify, the final question is: What synchronization mechanism guarantees that the draw call in the next command buffer is not submitted until the current draw call is finished?

This is not part of the question, but I'd also appreciate pointers on articles or preferably books on this sort of GPU behavior that's relevant to programming. I've ordered both Learning Vulkan by Parminder Singh and Vulkan Programming Guide by Graham Sellers, but they haven't arrived yet and don't seem to include too much on GPU hardware anyway (I might be wrong though). I unfortunately don't like the format of the Vulkan Cookbook by Pawel Lapinski that I read is one of the better options - I much prefer properly grasping the theory and doing things myself than following a "recipe" — cluntraru, Jun 14 '20 at 10:30
Does this answer your question? [Synchronization between drawcalls in Vulkan](https://stackoverflow.com/questions/56849788/synchronization-between-drawcalls-in-vulkan) — krOoze, Jun 14 '20 at 11:13
Not exactly. It explains that draws in the same subpass are executed as if in order, but what about different subpasses (or almost identical command buffers submitted twice in a row in this case)? It mentions that those are governed by external dependencies, but the only external subpass dependency here would be the one used to synchronize with the imageAcquired semaphore that has no srcAccessMask and, as a result, should not actually wait on any color attachment stages since there is no resource it needs to access (correct me if I'm wrong). — cluntraru, Jun 14 '20 at 11:27
Ran out of characters... in this case, wouldn't the following drawFrame() call run with no synchronization between the 2 and cause something like: render 1 starts, then render 2 starts, then render 1 finishes, then present 1 starts, then render 2 finishes etc.? In this case, the depth buffer would be reused and presumably broken. — cluntraru, Jun 14 '20 at 11:30
@cluntraru: To answer your question would require analyzing the dependency graph of the tutorial code in question, which is a pretty big thing to ask. For all I know, there are explicit events or semaphore waits that prevent overlap between frames. I don't know if that's true, but I also don't know it isn't true. And the only way to find out would be to read the entire code of the tutorial. — Nicol Bolas, Jun 14 '20 at 14:00
@NicolBolas I was under the impression that it is a generally accepted thing that having a single depth attachment is enough for a classic acquire - render - present loop. For example, [Sascha Willems'](https://github.com/SaschaWillems/Vulkan/blob/master/examples/triangle/triangle.cpp#L333) example seems to apply the same principle, at least to my untrained eye. The situation also seems to be similar, the semaphores being command buffer specific. — cluntraru, Jun 14 '20 at 14:26
@NicolBolas I understand if someone else's code is a hassle to look through (which it is, Vulkan is super verbose). If either of you have an example of a correct depth attachment implementation with a basic acquire-render-present loop, I'd be more than happy to discuss that case. The only thing that's needed to make an example relevant is the swapchain having at least 2 potential images to return on acquire. — cluntraru, Jun 14 '20 at 14:48
Vulkan is an explicit API. You almost always need to explicitly synchronize in Vulkan. You can count the exceptions to that on one hand, if you accidentally put your hand in a running lawn mower. As per the Q I linked, in a single subpass, there is [Rasterization Order](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html#fragops), because the syncing the `vkDraw*`s would get riddiculous. Everything else must be synced explicitly. Your app sample either does the explicit sync correctly, or is nonconformant. Either way that does not feel on topic to the question as askd — krOoze, Jun 15 '20 at 02:22
@krOoze I don't see how it's not on topic, my question essentially boils down to "What synchronization mechanism guarantees that the next frame does not start drawing before the one that was just submitted has finished?" If discussing on the code samples that I provided is not an option, that's perfectly fine. How would you implement an acquire-draw-present loop that uses a single depth image with multiple swapchain images? Specifically, how would you make sure that rendering the next frame would not begin before rendering of the current frame ends? — cluntraru, Jun 15 '20 at 08:38
@cluntraru Except I do not see any of this in the Q. You asked whether the implicit sync guarantees apply across subpasses and render passes, and the answer is "no", and you said this would answer your Q. Whether some app got its explicit sync scheme right, or wrong (which might even work correctly in reality due to undefined behavior), simply seems besides the point to me. Also TBH, this is a tutorial; the author should be explaining his code and any controversial statements, not us. I believe there is a Discord forum for the purpose, as well as Issue tickets on GitHub. — krOoze, Jun 15 '20 at 13:43
@krOoze I feel like debating the contents and motivations of the question would go nowhere at this point. The author is no longer active in the comments, which is why I posted here. By "this would answer my question" I meant if draw commands synchronized to other commands outside the subpass, which turned out not to be true. Some of your comments did prove to be enlightening, so thank you. — cluntraru, Jun 15 '20 at 14:01
I am leaving the question open in case someone familiar with the tutorial comes across it and has a specific answer. — cluntraru, Jun 15 '20 at 14:11

j00hi · Accepted Answer · 2020-06-16T09:58:08.680

12

I'm afraid, I have to say that the Vulkan Tutorial is wrong. In its current state, it can not be guaranteed that there are no memory hazards when using only one single depth buffer. However, it would require only a very small change so that only one depth buffer would be sufficient.

Let's analyze the relevant steps of the code that are performed within drawFrame.

We have two different queues: presentQueue and graphicsQueue, and MAX_FRAMES_IN_FLIGHT concurrent frames. I refer to the "in flight index" with cf (which stands for currentFrame = (currentFrame + 1) % MAX_FRAMES_IN_FLIGHT). I am using sem1 and sem2 to represent the different arrays of semaphores and fence for the array of fences.

The relevant steps in pseudocode are the following:

vkWaitForFences(..., fence[cf], ...);
vkAcquireNextImageKHR(..., /* signal when done: */ sem1[cf], ...);
vkResetFences(..., fence[cf]);
vkQueueSubmit(graphicsQueue, ...
    /* wait for: */ sem1[cf], /* wait stage: *, COLOR_ATTACHMENT_OUTPUT ...
    vkCmdBeginRenderPass(cb[cf], ...);
      Subpass Dependency between EXTERNAL -> 0:
          srcStages = COLOR_ATTACHMENT_OUTPUT,
          srcAccess = 0, 
          dstStages = COLOR_ATTACHMENT_OUTPUT,
          dstAccess = COLOR_ATTACHMENT_WRITE
      ...
      vkCmdDrawIndexed(cb[cf], ...);
      (Implicit!) Subpass Dependency between 0 -> EXTERNAL:
          srcStages = ALL_COMMANDS,
          srcAccess = COLOR_ATTACHMENT_WRITE|DEPTH_STENCIL_WRITE, 
          dstStages = BOTTOM_OF_PIPE,
          dstAccess = 0
    vkCmdEndRenderPass(cb[cf]);
    /* signal when done: */ sem2[cf], ...
    /* signal when done: */ fence[cf]
);
vkQueuePresent(presentQueue, ... /* wait for: */ sem2[cf], ...);

The draw calls are performed on one single queue: the graphicsQueue. We must check if commands on that graphicsQueue could theoretically overlap.

Let us consider the events that are happening on the graphicsQueue in chronological order for the first two frames:

img[0] -> sem1[0] signal -> t|...|ef|fs|lf|co|b -> sem2[0] signal, fence[0] signal
img[1] -> sem1[1] signal -> t|...|ef|fs|lf|co|b -> sem2[1] signal, fence[1] signal

where t|...|ef|fs|lf|co|b stands for the different pipeline stages, a draw call passes through:

t ... TOP_OF_PIPE
ef ... EARLY_FRAGMENT_TESTS
fs ... FRAGMENT_SHADER
lf ... LATE_FRAGMENT_TESTS
co ... COLOR_ATTACHMENT_OUTPUT
b ... BOTTOM_OF_PIPE

While there might be an implicit dependency between sem2[i] signal -> present and sem1[i+1], this only applies when the swap chain provides only one image (or if it would always provide the same image). In the general case, this can not be assumed. That means, there is nothing which would delay the immediate progression of the subsequent frame after the first frame is handed over to present. The fences also do not help because after fence[i] signal, the code waits on fence[i+1], i.e. that also does not prevent progression of subsequent frames in the general case.

What I mean by all of that: The second frame starts rendering concurrently to the first frame and there is nothing that prevents it from accessing the depth buffer concurrently as far as I can tell.

The Fix:

If we wanted to use only a single depth buffer, though, we can fix the tutorial's code: What we want to achieve is that the ef and lf stages wait for the previous draw call to complete before resuming. I.e. we want to create the following scenario:

img[0] -> sem1[0] signal -> t|...|ef|fs|lf|co|b -> sem2[0] signal, fence[0] signal
img[1] -> sem1[1] signal -> t|...|________|ef|fs|lf|co|b -> sem2[1] signal, fence[1] signal

where _ indicates a wait operation.

In order to achieve this, we would have to add a barrier that prevents subsequent frames performing the EARLY_FRAGMENT_TEST and LATE_FRAGMENT_TEST stages at the same time. There is only one queue where the draw calls are performed, so only the commands in the graphicsQueue require a barrier. The "barrier" can be established by using the subpass dependencies:

vkWaitForFences(..., fence[cf], ...);
vkAcquireNextImageKHR(..., /* signal when done: */ sem1[cf], ...);
vkResetFences(..., fence[cf]);
vkQueueSubmit(graphicsQueue, ...
    /* wait for: */ sem1[cf], /* wait stage: *, EARLY_FRAGMENT_TEST...
    vkCmdBeginRenderPass(cb[cf], ...);
      Subpass Dependency between EXTERNAL -> 0:
          srcStages = EARLY_FRAGMENT_TEST|LATE_FRAGMENT_TEST,
          srcAccess = DEPTH_STENCIL_ATTACHMENT_WRITE, 
          dstStages = EARLY_FRAGMENT_TEST|LATE_FRAGMENT_TEST,
          dstAccess = DEPTH_STENCIL_ATTACHMENT_WRITE|DEPTH_STENCIL_ATTACHMENT_READ
      ...
      vkCmdDrawIndexed(cb[cf], ...);
      (Implicit!) Subpass Dependency between 0 -> EXTERNAL:
          srcStages = ALL_COMMANDS,
          srcAccess = COLOR_ATTACHMENT_WRITE|DEPTH_STENCIL_WRITE, 
          dstStages = BOTTOM_OF_PIPE,
          dstAccess = 0
    vkCmdEndRenderPass(cb[cf]);
    /* signal when done: */ sem2[cf], ...
    /* signal when done: */ fence[cf]
);
vkQueuePresent(presentQueue, ... /* wait for: */ sem2[cf], ...);

This should establish a proper barrier on the graphicsQueue between the draw calls of the different frames. Because it is an EXTERNAL -> 0-type subpass dependency, we can be sure that renderpass-external commands are synchronized (i.e. sync with the previous frame).

Update: Also the wait stage for sem1[cf] has to be changed from COLOR_ATTACHMENT_OUTPUT to EARLY_FRAGMENT_TEST. This is because layout transitions happen at vkCmdBeginRenderPass time: after the first synchronization scope (srcStages and srcAccess) and before the second synchronization scope (dstStages and dstAccess). Therefore, the swapchain image must be available there already so that the layout transition happens at the right point in time.

edited Jun 16 '20 at 09:58

answered Jun 15 '20 at 22:55

j00hi

5,420
3
45
82

Thank you for the complete answer! I think it might be worth clarifying that the old subpass dependency is not deleted (still needed for semaphore sync), and that there are now 2 from EXTERNAL to 0, just to avoid potential confusion for anyone else reading this. Also, shouldn't dstAccess in the fix also include DEPTH_STENCIL_ATTACHMENT_READ, not just WRITE? – cluntraru Jun 16 '20 at 06:35
Regarding the "old subpass dependency": Do you mean that `COLOR_ATTACHMENT_OUTPUT -> COLOR_ATTACHMENT_OUTPUT` would still be necessary? The batch's `COLOR_ATTACHMENT_OUTPUT` stages must wait for `sem1[cf]` to signal anyways. I do not think that such a dependency must be included in the subpass dependencies, unless I am overlooking something. – j00hi Jun 16 '20 at 07:55
Regarding the two subpass dependencies: Are you referring to the second "`(Implicit!)`" subpass dependency? – j00hi Jun 16 '20 at 07:55
Regarding the `dstAccess`: Yes, you are right. A depth test always performs read and write access. The read access must see the cleared depth buffer values already. Synchronizing with `DEPTH_STENCIL_ATTACHMENT_WRITE` only is not sufficient. I'll update it to `DEPTH_STENCIL_ATTACHMENT_WRITE|DEPTH_STENCIL_ATTACHMENT_READ`. Thanks for catching that! – j00hi Jun 16 '20 at 07:57
By two dependencies, I meant the COLOR_ATTACHMENT_OUTPUT and the one you just added (the implicit 0->EXTERNAL would be a third, not counted in what I said). The old one is still necessary, so the layout transition on the color attachment happens after the semaphore wait that you mentioned. Yes, the color attachment stage waits on the semaphore, but the layout transition happens at a point specified after the src and before the dst of the dependency. On that note, if there are two EXTERNAL->0 dependencies, which one is used to determine when the transition happens? – cluntraru Jun 16 '20 at 08:36
It seems like it would be equivalent to just merge them into a single subpass dependency though. – cluntraru Jun 16 '20 at 08:44
Oh, I see. The layout transition would happen at `vkCmdBeginRenderPass` time, wouldn't it? And `vkCmdBeginRenderPass` would not happen before previous `EARLY_FRAGMENT_TEST|LATE_FRAGMENT_TEST` stages have completed, so to say. I think, the best way would probably be to just change the wait stage of the semaphore to: `/* wait for: */ sem1[cf], /* wait stage: *, EARLY_FRAGMENT_TEST`. – j00hi Jun 16 '20 at 08:47
AFAIK, the layout transition would happen between srcStageMask and dstStageMask and, since we need it to happen after the semaphore signals, anything that creates an execution dependency chain with the semaphore wait should be fine. So since EARLY_FRAGMENT_TEST comes before COLOR_ATTACHMENT (where the image is needed) it should be alright and it only loses us 1-2 stages in synchronization. Also, I just realized that merging the dependencies would not actually be equivalent. – cluntraru Jun 16 '20 at 09:08
Yes, it wouldn't. Because the layout transition would happen before the image has become available (before the semaphore signals). That sounds like undefined behavior once again. I'll update my answer to waiting in `EARLY_FRAGMENT_TEST` since that solution should not have such issues. – j00hi Jun 16 '20 at 09:39
Load op is in `VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT` and store op is in `VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT`. Seems to me `srcStage=LATE_FRAGMENT_TEST` and `dstStage=EARLY_FRAGMENT_TEST` should suffice. Similarly clear load op is a write, so just `dstAccess=WRITE` should suffice. – krOoze Jun 16 '20 at 16:19
Also not sure why `pWaitDstStage` should be changed. That does not seem to solve nothing. The semaphore only relates to swapchain image, and the layout transition happens between `srcStage` and `dstStage` of the dependency; there should be no problem assuming there is not layout change to `finalLayout`. Though the original `pWaitDstStage` value should still be in the dependency chain, so `srcStage=COLOR | LATE`. – krOoze Jun 16 '20 at 16:28
Hmm, you mean that before any depth-read access is performed, the whole depth buffer is cleared and hence,`dstAccess=WRITE` should suffice because *that* `dstAccess` in particular refers to the **clear**-operation and not to depth read access. Is that correct? If a `LOAD_OP_LOAD` operation would be used for the depth attachment, then we would need `dstAccess=READ|WRITE`, right? – j00hi Jun 16 '20 at 18:00
Regarding `pWaitDstStage`: If this would not be set to `EARLY_FRAGMENT_TEST`, then there would be two `srcStage`/`dstStage` pairs: The one of the subpass dependency and the one of the semaphore. Between which of these two does the image layout transition of the **color attachment** happen? Between the src/dst of the subpass dependency or between the src/dst of the semaphore? I think that the layout transition happens at `vkCmdBeginRenderPass` time, doesn't it? If it would happen between the semaphore's src/dst-pair, then it would NOT happen at `vkCmdBeginRenderPass` time. – j00hi Jun 16 '20 at 18:08
Regarding "Load op is in `VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT` and store op is in `VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT`": I don't think that this is the case in general, because Table 4 under [6.1.3. Access Types](https://www.khronos.org/registry/vulkan/specs/1.1/html/vkspec.html#synchronization-access-types) lists both, `VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT` and `VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT` as valid access flags for both stages, `EARLY_FRAGMENT_TESTS` and `LATE_FRAGMENT_TESTS`. – j00hi Jun 16 '20 at 18:14
Regarding `pWaitDstStage` once more: After comparing with the [Swapchain Image Acquire and Present](https://github.com/KhronosGroup/Vulkan-Docs/wiki/Synchronization-Examples#swapchain-image-acquire-and-present) example from Khronos' Synchronization Examples, I am a lot more certain that `EARLY_FRAGMENT_TEST` is the right stage for both, the subpass dependency and the semaphore's wait stage. The example says about the subpass dependency stages: "`// .srcStageMask needs to be a part of pWaitDstStageMask in the WSI semaphore.`". => we'll have to use `EARLY_FRAGMENT_TEST` for "**The Fix**". – j00hi Jun 16 '20 at 19:16
`LOAD_OP_LOAD` would be just `READ`. `vkCmdBeginRenderPass`-time is not a thing; at least I have no idea what you mean by it. It is just a state setting command. External dependencies are what matters determining when layout transition from `initial` and to `final` happen. (BTW one solution would be to never change depth buffer layout too.). For this purpose multiple dependencies all apply as one; just search spec for "Automatic layout transition". When load op and store op happens and with which access flags is also explicitly stated in the spec, so it is general. – krOoze Jun 16 '20 at 19:24
Because we are using `LOAD_OP_CLEAR`, we can safely set `dstAccess = DEPTH_STENCIL_ATTACHMENT_WRITE` in the subpass dependency --- is that what you mean/would you agree? Because the spec says: *"The load operation for each sample in an attachment happens-before any recorded command which accesses the sample in the first subpass where the attachment is used."* But can we be sure that `DEPTH_STENCIL_ATTACHMENT_READ` access is also visible? Because we'll definitely need read access to the depth attachment. – j00hi Jun 18 '20 at 19:25
With "`vkCmdBeginRenderpass`-time" I actually meant the `EXTERNAL -> 0` dependency. (I should not introduce new terminology, agreed.) The point I wanted to make... or the hypothesis I am making: The image layout transition of the color attachment happens between `srcStage` and `dstStage` of the `EXTERNAL -> 0` subpass dependency. Is that assumption correct? – j00hi Jun 18 '20 at 19:29
@j00hi Wouldn't it be an easier solution to just have a depth buffer image per swapchain image? – Daniel Marques Sep 26 '21 at 16:18
1

@DanielMarques of course that would be an option---maybe even a good one since it theoretically allows to parallelize more. It's just that the question was about using a single depth buffer, which could be advantageous in extremely memory-limited settings, for example. – j00hi Sep 28 '21 at 07:16

krOoze · Answer 2 · 2020-06-15T22:45:18.700

No, rasterization order does not (per specification) extend outside a single subpass. If multiple subpasses write to the same depth buffer, then there should be a VkSubpassDependency between them. If something outside a render pass writes to the depth buffer, then there should also be explicit synchronization (via barriers, semaphores, or fences).

FWIW I think the vulkan-tutorial sample is non-conformant. At least I do not see anything that would prevent a memory hazard on the depth buffer. It seems that the depth buffer should be duplicated to MAX_FRAMES_IN_FLIGHT, or explicitly synchronized.

The sneaky part about undefined behavior is that wrong code often works correctly. Unfortunately making sync proofs in the validation layers is little bit tricky, so for now only thing that remains is to simply be careful.

Futureproofing the answer:
What I do see is conventional WSI semaphore chain (used with vkAnquireNextImageKHR and vkQueuePresentKHR) with imageAvailable and renderFinished semaphores. There is only one subpass dependency with VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, that is chained to the imageAvailable semaphore. Then there are fences with MAX_FRAMES_IN_FLIGHT == 2, and fences guarding the individual swapchain images. Meaning two subsequent frames should run unimpeded wrt each other (except in the rare case they acquire the same swapchain image). So, the depth buffer seems to be unprotected between two frames.

This what I suspected as well, leading to my posting the question. Thank you for putting in the time to look through the code, I know it was a lot to ask. — cluntraru, Jun 15 '20 at 19:55

score 1 · Answer 3 · answered May 06 '22 at 23:32

Yes, I also spent some time trying to figure out what was meant by the statement "We only need a single depth image, because only one draw operation is running at once."

That didn't make sense to me for a triple buffered rendering setup where work is submitted to the queues until MAX_FRAMES_IN_FLIGHT is reached - there's no guarantee that all three aren't running at once!

Whilst the single depth image worked OK, triplicating everything so each frame uses a fully independent set of resources (blocks and all) would seem to be the safest design and yielded identical performance under test.

Why is a single depth buffer sufficient for this vulkan swapchain render loop?

3 Answers3