Profiling OpenGL application - when the driver is blocking the CPU side

Question

I made an in-game graphical profiler (CPU and GPU) and there is one strange behavior with the Nvidia driver that I'm not sure how to handle.

Here is a screenshot of what a normal case looks like: GPU Profiler, vysnc on What you can see here is 3 consecutive frames, GPU at the top, CPU at the bottom. Both graphs are synchronized.

The "END FRAME" bar only contains the call to SwapBuffers. It can seem weird that it's blocking until the GPU has done all its work, but that's what the driver chooses to do sometimes when vsync is ON and that all the work (CPU and GPU) can fit in 16ms (AMD does the same). My guess is that it does it to minimize inputs lag.

Now my problem is that it does not always do that. Depending on what happens in the frame, the graph sometimes looks like this: GPU Profiler, vysnc on, V2 What actually happens here, is that the first OpenGL call is blocking, instead of the call to SwapBuffers. In this particular case, the blocking call is glBufferData. It's much more visible if I add a dummy code that does just that (create a uniform buffer, load it with random values and destroy it):

GPU Profiler, vysnc on, V2 with dummy code

This is a problem because it means a bar in the graph may get very big for no apparent reason. People seeing that will probably draw an incorrect conclusion about some code being slow.

So my question is, how can I handle this case? I need a way to display meaningful CPU timings at all time.

Adding a dummy code that loads a uniform buffer is not very elegant and may not work for future version of the driver (what if the driver only blocks on drawcalls instead?).

Synchronizing with a glClientWaitSync does not look like a good thing to do either, because if the frame rate drops, the driver will stop blocking to allow the CPU and GPU frames to be run in parallel and I need to detect that to stop calling glClientWaitSync (but I'm not sure how to do that.)

(Suggestions for a better title are welcome.)

Edit: here is what happens without vsync, when the GPU is the bottleneck: GPU Profiler, vysnc off, V2 The GPU frame takes longer than the CPU frame so the driver decided to block the CPU during glBufferData until the GPU has caught up.

The conditions are not the same, but the problem is: the CPU timings are "wrong" because the driver make some of the OpenGL function block. That may actually be a simpler example to understand than the one with vsync on.

Well, the GL might block at any point, for any implementation-specific reason. OTOH, SwapBuffers is not guaranteed to block, even if vsync is on. E.g., the nvidia drivers on windows even have a config option of how many frames it might buffer in advance before having to block - however, that is also not a reliable minimum, it might block earlier, if you do much work in a frame, or force some inmplicit or explicit synchronization. I once obseved a weird issue where the nvidia driver was alternating between 1 and 2 frames of latency, producing jerky animantions while still at 60fps average. — derhass, Feb 22 '15 at 22:48
I don't really know how to interpret what I'm seeing in the CPU profiler in any of your screenshots, so the changes in your edit aren't as insightful as they could be. Nevertheless, I thoroughly outlined the reasons VSYNC causes unpredictable blocking behavior... you might find that useful. — Andon M. Coleman, Feb 23 '15 at 00:15
In the last screenshot, the GPU frame takes longer than the CPU frame, so the driver blocks the CPU at some point to wait for the GPU. And that can happen in the SwapBuffer or elsewhere, like in the screen shot. But yes your answer is insightful, thanks. — Jerem, Feb 23 '15 at 07:39

score 3 · Answer 1 · answered Feb 22 '15 at 23:09

This is actually working as intended. Blocking due to VSYNC does not necessarily have to happen during the call to SwapBuffers (...), there are a couple of reasons VSYNC causes blocking and they are almost entirely out of your control.

When the swapchain is full of backbuffers waiting to be swapped (typically you only have 1 backbuffer), commands that would modify the framebuffer must not be allowed to execute until the swap finishes. This causes a pipeline stall, and is the first strike. Keep in mind that even though the pipeline is stalled, GL may still queue up commands in this state.

On most platforms there is no API that allows you to explicitly request the number of backbuffers in the window system's swapchain. You can request single or double- buffered and the driver may interpret double-buffered as meaning 2 or more (you will see this labeled "Enable Triple Buffering" in some drivers).

Strike two comes from something referred to as "render ahead." This is a driver-specific amount of work that GL will queue up before it refuses to accept new commands. Once again, you as the developer of OpenGL software, do not have any control over this. In some drivers you can dig really deep and configure this by hand. Increasing that value will allow the CPU to queue up more work while the pipeline is stalled, but tends to increase latency (particularly the way D3D implements it, which forbids frame dropping).

Once the render pipeline has stalled waiting for a buffer swap and you exhaust your render ahead limit, that is strike three. The calling thread will block on the next GL command until VBLANK rolls around and unclogs the pipeline.

glClientWaitSync (...), as you described, would effectively eliminate all render-ahead. That might be desirable to minimize timing variation, but if you are having trouble hitting your refresh rate it is going to negatively impact overall framerate.

Adaptive VSYNC should be the first thing you pursue. On drivers that support this feature, you enable it by setting a negative swap interval and it will avoid blocking when you cannot sustain your refresh rate. In effect, the purpose of adaptive VSYNC is to throttle rendering when you are drawing too quickly. If you are drawing quicker than your monitor can handle, profiling GL API calls does not seem particularly important.

In the worst case, you can always disable VSYNC altogether. In a modern compositing window manager like Windows Vista introduced, tearing is prevented in windowed mode whether you enable VSYNC or not. VSYNC really just saves electricity in that situation and turning it off for more accurate profiling is probably an acceptable compromise. You can just as easily implement your own throttling mechanism to prevent your engine from drawing at ridiculously high framerates without the unpredictable behavior VSYNC introduces.

I'd like to add, that this indeterministic timing behaviour is throwing serious logs between the legs of virtual reality systems. Due to the rekindled interest in VR and due to the pushes by well respected people like John Carmack and Michael Abrash we've now seen (vendor specific) extensions added to OpenGL that allow for a much finer grained controlling of timing and VSync behaviour. Also some people (\*cough\* me \*cough\*) are looking into realtime programming methods to obtain a open loop synchronization by means of precise program execution timing. — datenwolf, Feb 23 '15 at 09:35
@datenwolf what are these extensions? Maybe one of them could help me control this behavior. — Jerem, Feb 23 '15 at 10:23
@Jerem: I'm mostly thinking about the …_NV_delay_before_swap extensions here, which allow to introduce a CPU delay, that will end a specified time before the next scheduled swap (which usually coincides with a VSync) will happen. The main use for this is to allow for VR systems to integrate HMD inputs into the final rendering output so that head movements that happened during rendering can be compensated by slightly distorting/moving the images presented. OculusVR calls this method timewarp, but to work you need precise timing of when you make the final input integration, before swapping. — datenwolf, Feb 23 '15 at 10:32
That could be a good solution: If I can call wglDelayBeforeSwapNV it to wait until a very short time before the next scheduled swap, then call SwapBuffers, maybe all the waiting will be inside the call to wglDelayBeforeSwapNV. The doc says it doesn't wait if vsync is off, but I'll still try out of curiosity. Thanks. — Jerem, Feb 23 '15 at 11:25
Unfortunately it doesn't work as I expected, even with vsync on: the game misses a vsync at almost every frame (the frame rate drops to ~40). I tried to set the waiting time to 1ms, 100us and 10us but the result is the same every time :( — Jerem, Feb 23 '15 at 11:52

Profiling OpenGL application - when the driver is blocking the CPU side

1 Answers1