How to measure execution time of Vulkan pipeline

Question

Summary

I wish to be able to measure time elapsed in milliseconds, on the GPU, of running the entire graphics pipeline. The goal: To be able to save benchmarks before/after optimizing the code (next step would be mipmapping textures) to see improvements. This was really simple in OpenGL, but I'm new to Vulkan, and could use some help.

I have browsed related existing answers (here and here), but they aren't really of much help. And I cannot find code samples anywhere, so I dare ask here.

Through documentation pages I have found a couple of functions that I think I should be using, so I have in place something like this:

1: Creating query pool

void CreateQueryPool()
{
    VkQueryPoolCreateInfo createInfo{};
    createInfo.sType = VK_STRUCTURE_TYPE_QUERY_POOL_CREATE_INFO;
    createInfo.pNext = nullptr; // Optional
    createInfo.flags = 0; // Reserved for future use, must be 0!

    createInfo.queryType = VK_QUERY_TYPE_TIMESTAMP;
    createInfo.queryCount = mCommandBuffers.size() * 2; // REVIEW

    VkResult result = vkCreateQueryPool(mDevice, &createInfo, nullptr, &mTimeQueryPool);
    if (result != VK_SUCCESS)
    {
        throw std::runtime_error("Failed to create time query pool!");
    }
}

I had the idea of queryCount = mCommandBuffers.size() * 2 to have space for a separate query timestamp before and after rendering, but I have no clue whether this assumption is correct or not.

2: Recording command buffers

// recording command buffer i:
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, mTimeQueryPool, i);
// render pass ...
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, mTimeQueryPool, i);

vkCmdCopyQueryPoolResults(/* many parameters here */);

I'm looking for a couple of clarifications:

What is the concequence of writing to the same query index? Do I need two separate query pools - one for before render time and one for after render time?
How should I handle synchronization? I assume having a separate query for each command buffer.
For the destination buffer containing the query result, is it good enough to store somewhere with "host visible bit", or do I need staging memory for "device visible only"? I'm a bit lost on this one as well.

I have not been able to find any online examples of how to measure render time, but I just assume it's such a common task that surely there must be an example out there somewhere.

"*next step would be mipmapping textures*" Unless you generated the textures on the GPU, or the texture is decidedly non-image-like, you should *always* mipmap your textures. — Nicol Bolas, May 02 '21 at 21:43
@NicolBolas Just following a tutorial. Decided to implement timing before moving to the mipmapping chapter, because I was curious to see how much it actually improves. — alexpanter, May 03 '21 at 07:24

alexpanter · Accepted Answer · 2021-05-04T10:32:34.620

So, thanks to @karlschultz, I managed to get something working. So in case other people will be looking for the same answer, I decided to post my findings here. For the Vulkan experts out there: Please let me know if I make obvious mistakes, and I will correct them here!

Query Pool Creation

I fill out a VkQueryPoolCreateInfo struct as described in my question, and let its queryCount field equal twice the number of command buffers, to store space for a query before and after rendering.

Important here is to reset all entries in the query pool before using the queries, and to reset a query after writing to it. This necessitates a few changes:

1) Asking graphics queue if timestamps are supported

When picking the graphics queue family, the struct VkQueueFamilyProperties has a field timestampValidBits which must be greater than 0, otherwise the queue family cannot be used for timestamp queries!

2) Determining the timestamp period

The physical device contains a special value which indicates the number of nanoseconds it takes for a timestamp query to be incremented by 1. This is necessary to interpret the query result as e.g. nanoseconds or milliseconds. That value is a float, and can be retrieved by calling vkGetPhysicalDeviceProperties and looking at the field VkPhysicalDeviceProperties.limits.timestampPeriod.

3) Asking for query reset support

During logical device creation, one must fill out a struct and add it to the pNext chain to enable the host query reset feature:

VkDeviceCreateInfo createInfo{};
VkPhysicalDeviceHostQueryResetFeatures resetFeatures;
resetFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_HOST_QUERY_RESET_FEATURES;
resetFeatures.pNext = nullptr;
resetFeatures.hostQueryReset = VK_TRUE;

createInfo.pNext = &resetFeatures;

4) Recording command buffers

Timestamp queries should be outside the scope of the render pass, as seen below. It is not possible to measure running time of a single shader (e.g. fragment shader), only the entire pipeline or whatever is outside the scope of the render pass, due to (potential) temporal overlap of pipeline stages.

vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, mTimeQueryPool, i * 2);

vkCmdBeginRenderPass(/* ... */);

// render here...

vkCmdEndRenderPass(mCommandBuffers[i]);

vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, mTimeQueryPool, i * 2 + 1);

5) Retrieving query result

We have two methods for this: vkCmdCopyQueryPoolResults and vkGetQueryPoolResults. I chose to go with the latter since is greatly simplifies the setup and does not require synchronization with GPU buffers.

Given that I have a swapchain index (in my scenario same is command buffer index!), I have a setup like this:

void FetchRenderTimeResults(uint32_t swapchainIndex)
{
    uint64_t buffer[2];

    VkResult result = vkGetQueryPoolResults(mDevice, mTimeQueryPool, swapchainIndex * 2, 2, sizeof(uint64_t) * 2, buffer, sizeof(uint64_t),
    VK_QUERY_RESULT_64_BIT);
    if (result == VK_NOT_READY)
    {
        return;
    }
    else if (result == VK_SUCCESS)
    {
        mTimeQueryResults[swapchainIndex] = buffer[1] - buffer[0];
    }
    else
    {
        throw std::runtime_error("Failed to receive query results!");
    }

    // Queries must be reset after each individual use.
    vkResetQueryPool(mDevice, mTimeQueryPool, swapchainIndex * 2, 2);
}

The variable mTimeQueryResults refers to an std::vector<uint64_t> which contains a result for each swapchain. I use it to calculate an average rendering time each second by using the timestamp period determined in step 2).

And one must not forget to cleanup to query pool by calling vkDestroyQueryPool.

There are a lot of details omitted, and for a total Vulkan noob like me this setup was frightening and took several days to figure out. Hopefully this will spare someone else the headache.

More info in documentation.

This is a nice write-up. As an alternative to (3), you could instead reset the query pool with a `vkCmdResetQueryPool` command in your command buffer, placed before the commands to write the timestamps. There's an implicit execution dependency for these query-related commands, so the reset will finish before the timestamps are written. The choice between the two approaches probably depends on other aspects of the application, but with the device-based reset, you don't need to enable the device feature and take into account host sync with `vkResetQueryPool`. — Karl Schultz, May 04 '21 at 15:39
That's actually a better solution. I just though `vkCmdResetQueryPool` would necessitate the command buffer copy command. But I can only imagine that host query reset is slower as well. Again thanks for your inputs. — alexpanter, May 04 '21 at 17:53

score 1 · Answer 2 · answered May 03 '21 at 15:08

Writing to the same query index is bad because you are overwriting your "before" timestamp with the "after" timestamp at the same query index. You might want to change the last parameter in your write timestamp calls to i * 2 for the "before" call and to i * 2 + 1 for the "after". You are already allocating 2 timestamps for each command buffer, but only using half of them. This scheme ends up producing a pair of before/after timestamps for each command buffer i.

I don't have any experience using vkCmdCopyQueryPoolResults(). If you can idle your queue, then after idle, call vkGetQueryPoolResults() which will probably be much easier for what you are doing here. It copies the query results back into host memory and you don't have to mess with synchronizing writes to another buffer and then mapping/reading it back.

Thank you so much! It has been literally impossible for me to find code samples anywhere. I will try this. +1 — alexpanter, May 03 '21 at 15:20