This is the simplified pseudocode where I'm trying to measure a GPU workload:
for(N) vkCmdDrawIndexed();
vkCmdWriteTimestamp(VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT);
vkCmdWriteTimestamp(VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT);
submit();
vkDeviceWaitIdle();
vkGetQueryPoolResults();
Things to note:
- N is 224 in my case
- I have to wait for an idle device - without it, I continue to receive a validation error saying me that the data is not ready though I have multiple query pools in flight
- putting the first timestamp I expect that the query value will be written as soon as all previous commands reached a preprocessing step. I was pretty sure that all 224 commands are preprocessed almost at the same time but the reality shows that this is not true.
- putting the second timestamp I expect that the query value will be written after all previous commands are finished. I.e. the time difference between these two query values should give me the time it takes for the GPU to do all the work for a single frame.
- I'm taking into account
VkPhysicalDeviceLimits::timestampPeriod
(1 on my machine) andVkQueueFamilyProperties::timestampValidBits
(64 on my machine)
I created a big dataset that visually takes approx 2 seconds (~2000ms) to render a single frame. But the calculated time has only 2 (two) different values - either 0.001024ms or 0.002048ms so the frame by frame output can look like this:
0.001024ms
0.001024ms
0.002048ms
0.001024ms
0.002048ms
0.002048ms
...
Don't know how about you, but I find these values VERY suspicious. I have no answer for that. Maybe at the time, the last draw command reaches the command processor all the work is already done, but why the hell 1024 and 2048??
I tried to modify the code and move the first timestamp above, i.e.:
vkCmdWriteTimestamp(VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT);
for(N) vkCmdDrawIndexed();
vkCmdWriteTimestamp(VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT);
Now at the time the preprocessor hits the timestamp command, it writes the query value immediately, because there was no previous work and nothing to wait (remember idle device). This time I have another, closer to the truth values:
20.9336ms
20.9736ms
21.036ms
21.0196ms
20.9572ms
21.3586ms
...
which is better but still far beyond expected ~2000ms.
What's going on, what's happening inside the device when I set timestamps, how to get correct values?