Vulkan Queue Synchronization in Multithreading

Question

In my application it is imperative that "state" and "graphics" are processed in separate threads. So for example, the "state" thread is only concerned with updating object positions, and the "graphics" thread is only concerned with graphically outputting the current state.

For simplicity, let's say that the entirety of the state data is contained within a single VkBuffer. The "state" thread creates a Compute Pipeline with a Storage Buffer backed by the VkBuffer, and periodically vkCmdDispatchs to update the VkBuffer.

Concurrently, the "graphics" thread creates a Graphics Pipeline with a Uniform Buffer backed by the same VkBuffer, and periodically draws/vkQueuePresentKHRs.

Obviously there must be some sort of synchronization mechanism to prevent the "graphics" thread from reading from the VkBuffer whilst the "state" thread is writing to it.

The only idea I have is to employ the usage of a host mutex fromvkQueueSubmit to vkWaitForFences in both threads.

I want to know, is there perhaps some other method that is more efficient or is this considered to be OK?

Ekzuzy · Accepted Answer · 2018-01-04T07:56:52.263

Try using semaphores. They are used to synchronize operations solely on the GPU, which is much more optimal than waiting in the app and submitting work after previous work is fully processed.

When You submit work You can provide a semaphore which gets signaled when this work is finished. When You submit another work You can provide the same semaphore on which the second batch should wait. Processing of the second batch will start automatically when the semaphore gets signaled (this semaphore is also automatically unsignaled and can be reused).

(I think there are some constraints on using semaphores, associated with queues. I will update the answer later when I confirm this but they should be sufficient for Your purposes.

[EDIT] There are constraints on using semaphores but it shouldn't affect You - when You use a semaphore as a wait semaphore during submission, no other queue can wait on the same semaphore.)

There are also events in Vulkan which can be used for similar purposes but their use is a little bit more complicated.

If You really need to synchronize GPU and Your application, use fences. They are signaled in a similar way as semaphores. But You can check their state on the app side and You need to manually unsignal them before You can use then again.

[EDIT]

I've added an image that more or less shows what I think You should do. One thread calculates state and with each submission adds a semaphore to the top of the list (or a ring buffer as @NicolasBolas wrote). This semaphore gets signaled when the submission is finished (it is provided in pSignalSemaphores during "compute" batch submission).

Second thread renders Your scene. It manages it's own list of semaphores similarly to the compute thread. But when You want to render things, You need to be sure that compute thread finished calculations. That's why You need to take the latest "compute" semaphore and wait on it (provide it in pWaitSemaphores during "render" batch submission). When You submit rendering commands, compute thread can't start and modify the data because it may influence the results of a rendering. So compute thread also needs to wait until the most recent rendering is done. That's why compute thread also needs to provide a wait semaphore (the most recent "rendering" semaphore).

You just need to synchronize submissions. Rendering thread cannot start when a compute threads submits commands and vice versa. That's why adding semaphores to the lists (and taking semaphores from the list) should be synchronized. But this has nothing to do with Vulkan. Probably some mutex will be helpful (for example a C++-ish std::lock_guard<std::mutex>). But this synchronization is a problem only when You have a single buffer.

Another thing is what to do with old semaphores from both lists. You cannot directly check what is their state and You cannot directly unsignal them. The state of semaphores can be checked by using additional fences provided with each submission. You don't wait on them but from time to time check if a given fence is signaled and, if it is, You can destroy old semaphore (as You cannot unsignal it from the application) or You can make an empty submission, with no command buffers, and use that semaphore as a wait semaphore. This way the semaphore will be unsignaled and You can reuse it. But I don't know which solution is more optimal: destroying old and creating new semaphores, or unsignaling them with empty submissions.

When You have a single buffer, a one-element list/ring is probably enough. But more optimal solution would have some kind of a ping-pong set of buffers - You read data from one buffer, but store results in another buffer. And in the next step You swap them. That's why in the image above, the lists of semaphores (rings) may have more elements depending on Your setup. The more independent buffers and semaphores in the lists (of course to some reasonable count), the best performance You will get as You reduce time wasted on waiting. But this complicates Your code and it may also increase a lag (rendering thread gets data that is a bit older than the data currently processed by the compute thread). So You may need to balance performance, code complexity and a rendering lag.

I know what Semaphores, Events, and Fences are. However I do not see how to solve my problem without using host mutexes with them, or is there some more efficient method? If so, could you provide at least a high level pseudocode to demonstrate this. — Cinolt Yuklair, Jan 03 '18 at 22:04
If I understand correctly... You want to asynchronously calculate state using compute shaders and independently and also asynchronously render scene using the state data. So You need to associate a semaphore with each state calculation (provide a semaphore for each submission that calculates the state). Then when You want to render Your scene, You need to take the most recent semaphore and provide it as a wait semaphore during rendering commands submission. You may also need another semaphore that is signaled when Your rendering is done... — Ekzuzy, Jan 03 '18 at 23:32
This rendering semaphore may be needed as You probably don't want to perform another state calculations until Your rendering is finished (so You don't change the data during the rendering). This set of semaphores allows You to synchronize everything on the GPU side so Your application doesn't need to wait. But as not all semaphores may used to synchronize compute and graphics batches, You will also need some way to unsignal some of the semaphores so You can reuse them. But this is another topic which may need using fences but can also be done on another, separate thread. — Ekzuzy, Jan 03 '18 at 23:37
To further minimize waiting even on the GPU side, Your compute pipelines may read data from one buffer but store it in another buffer. This way both calculations and rendering can be performed partially at the same time. When You start rendering, another state calculations can also be performed as they will not alter the data used for rendering. Though it will also require semaphores from both sides (compute and graphics) for the same purpose as above. It will just speed things a bit for the sake of complexity. If You still need some pseudo code, I can prepare some tomorrow for You. — Ekzuzy, Jan 03 '18 at 23:46
I would appreciate some pseudocode, because I don't understand which semaphores to wait on. For example, if one thread waits on semaphore A, should the other thread wait on semaphore A too? That doesn't seem to synchronize them. So, should one thread wait on semaphore A, and the other wait on semaphore B? The two threads may be running at very different FPS (lets say 60 FPS for graphics, 240 FPS for state), so I don't see how two semaphores will achieve what i want. — Cinolt Yuklair, Jan 04 '18 at 01:52

score 0 · Answer 2 · answered Jan 04 '18 at 03:25

How you do this depends on two factors:

Whether you want to dispatch the compute operation on the same queue as its corresponding graphics operation.
The ratio of compute operations to their corresponding graphics operations.

#2 is the most important part.

Even though they are generated in separate threads, there must be at least some idea that the graphics operation is being fed by a particular compute operation (otherwise, how would the graphics thread know where the data is to read from?). So, how do you do that?

At the end of the day, that part has nothing to do with Vulkan. You need to use some inter-thread communication mechanism to allow the graphics thread to ask, "which compute task's data should I be using?"

Typically, this would be done by having the compute thread add every compute operation it does to some kind of circular buffer (thread-safe of course. And non-locking). When the graphics thread goes to decide where to read its data from, it asks the circular buffer for the most recently added compute operation.

In addition to the "where to read its data from" information, this would also provide the graphics thread with an appropriate Vulkan synchronization primitive to use to synchronize its command buffer(s) with the compute operation's CB.

If the compute and graphics operations are being dispatched on the same queue, then this is pretty simple. There doesn't have to actually be a synchronization primitive. So long as the graphics CBs are issued after the compute CBs in the batch, all the graphics CBs need is to have a vkCmdPipelineBarrier at the front which waits on all memory operations from the compute stage.

srcStageMask would be STAGE_COMPUTE_SHADER_BIT, with dstStageMask being, well, pretty much everything (you could narrow it down, but it won't matter, since at the very least your vertex shader stage will need to be there).

You would need a single VkMemoryBarrier in the pipeline barrier. It's srcAccessMask would be SHADER_WRITE_BIT, while the dstAccessMask would be however you intend to read it. If the compute operations wrote some vertex data, you need VERTEX_ATTRIBUTE_READ_BIT. If they wrote some uniform buffer data, you need UNIFORM_READ_BIT. And so on.

If you're dispatching these operations on separate queues, that's where you need an actual synchronization object.

There are several problems:

You cannot detect if a Vulkan semaphore has been signaled by user code. Nor can you set a semaphore to the unsignaled state by user code. Nor can you reasonably submit a batch that has a semaphore in it that is currently signaled and nobody's waiting on it. You can do the latter, but it won't do the right thing.

In short, you can never submit a batch that signals a semaphore unless you are certain that some process is going to wait for it.
You cannot issue a batch that waits on a semaphore, unless a batch that signals it is "pending execution". That is, your graphics thread cannot vkQueueSubmit its batch until it is certain that the compute queue has submitted its signaling batch.

So what you have to do is this. When the graphics queue goes to get its compute data, this must send a signal to the compute thread to add a semaphore to its next submit call. When the graphics thread submits its graphics operation, it then waits on that semaphore.

But to ensure proper ordering, the graphics thread cannot submit its operation until the compute thread has submitted the semaphore signaling operation. That requires a CPU-synchronization operation of some form. It could be as simple as the graphics thread polling an atomic variable set by the compute thread.

I don't agree with the bullet #1: "In short, you can never submit a batch that signals a semaphore unless you are certain that some process is going to wait for it." I know this may be suboptimal or may suggest a bad design, but from the spec perspective it is still valid to submit a batch of commands that signal a semaphore which no other submission waits on. Another thing - it may be unoptimal or may make little sense, but we can check if a semaphore is already signaled by using fences (we know that batch is already processed). Maybe events may also come handy in this situation. — Ekzuzy, Jan 04 '18 at 07:12
@Ekzuzy Right. What you cannot do is submit a wait, that has no signal operation submitted previously. — krOoze, Jan 04 '18 at 13:48
@krOoze Yes. And for unsignaling semaphores we can just submit empty batches. This may have some performance implications but I don't know if they are noticeable and if it is better to just destroy old semaphores and create new ones. — Ekzuzy, Jan 04 '18 at 13:51

Vulkan Queue Synchronization in Multithreading

2 Answers2