0

This is the simplified pseudocode where I'm trying to measure a GPU workload:

for(N) vkCmdDrawIndexed();

vkCmdWriteTimestamp(VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT);
vkCmdWriteTimestamp(VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT);

submit();
vkDeviceWaitIdle();
vkGetQueryPoolResults();

Things to note:

  • N is 224 in my case
  • I have to wait for an idle device - without it, I continue to receive a validation error saying me that the data is not ready though I have multiple query pools in flight
  • putting the first timestamp I expect that the query value will be written as soon as all previous commands reached a preprocessing step. I was pretty sure that all 224 commands are preprocessed almost at the same time but the reality shows that this is not true.
  • putting the second timestamp I expect that the query value will be written after all previous commands are finished. I.e. the time difference between these two query values should give me the time it takes for the GPU to do all the work for a single frame.
  • I'm taking into account VkPhysicalDeviceLimits::timestampPeriod (1 on my machine) and VkQueueFamilyProperties::timestampValidBits (64 on my machine)

I created a big dataset that visually takes approx 2 seconds (~2000ms) to render a single frame. But the calculated time has only 2 (two) different values - either 0.001024ms or 0.002048ms so the frame by frame output can look like this:

0.001024ms
0.001024ms
0.002048ms
0.001024ms
0.002048ms
0.002048ms
...

Don't know how about you, but I find these values VERY suspicious. I have no answer for that. Maybe at the time, the last draw command reaches the command processor all the work is already done, but why the hell 1024 and 2048??

I tried to modify the code and move the first timestamp above, i.e.:

vkCmdWriteTimestamp(VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT);

for(N) vkCmdDrawIndexed();
    
vkCmdWriteTimestamp(VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT);

Now at the time the preprocessor hits the timestamp command, it writes the query value immediately, because there was no previous work and nothing to wait (remember idle device). This time I have another, closer to the truth values:

20.9336ms
20.9736ms
21.036ms
21.0196ms
20.9572ms
21.3586ms
...

which is better but still far beyond expected ~2000ms.

What's going on, what's happening inside the device when I set timestamps, how to get correct values?

nikitablack
  • 4,359
  • 2
  • 35
  • 68
  • Where is the workload that you're trying to time? I don't understand what you think is happening with those two timestamp calls. Do you expect them to return the execution time of the set of draw calls? And which parts do you want to time the execution of: the vertex shaders? Fragment shaders? Rasterization? The whole pipeline? – Nicol Bolas Jul 27 '20 at 18:33
  • @Nical Bolas The work is happening in `vkCmdDrawIndexed` and I try to measure how long it takes by measuring two timestamps - before and after the work. So as you wrote - I expect them to return the execution time of the set of draw calls. In the queries, I specified the top of the pipe and bottom of pipe. With that, I expect the first value is written before the work starts and the second value is written after the whole pipeline for all draw calls is finished. – nikitablack Jul 28 '20 at 06:29

1 Answers1

1

While commands in Vulkan can be executed out of order (within certain restrictions), you should not broadly expect commands to be executed out of order. This is especially true of timer queries which, if they were executed out of order, would be unreliable in terms of their meaning.

Given that, your code is saying, "do a bunch of work. Then query the time it takes for the start of the pipe to be ready to execute new commands, then query the time it takes for the end of the pipe to be reached." Well, the start of the pipe might only be ready to execute new commands once most of the work is done.

Basically, what you think is happening is this:

top        work work work work work work | timer
stage1                                    work work work work work work 
stage2                                        work work work work work work 
bottom                                            work work work work work work | timer

But there's nothing that requires GPUs to execute this way. What is almost certainly actually happening is:

time->
top        work work work work work work | timer
stage1         work work work work work work 
stage2             work work work work work work 
bottom                 work work work work work work | timer

So your two timers are only getting a fraction of the actual work.

What you want is this:

top        timer | work work work work work work
stage1                 work work work work work work 
stage2                     work work work work work work 
bottom                         work work work work work work | timer

This queries the time from start to finish for the entire set of work.

So put the first query before the work whose time you want to measure.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • Than yous. I came to the same conclusion. Before I thought that the top of the pipe happens before everything, as on your first snippet. Interesting note - I found that the query values difference (and timestamps too) are always divisible by 1024. I think that queries can be written with certain granularity on my GPU, which explains 1024 and 2048 in timings. Can you confirm that? – nikitablack Jul 29 '20 at 07:13