glfwSwapBuffers slow (>3s) / How to make main-loop run smoothly while compute shader does longer calculations?

Question

I have written a program that does some calculations on a compute shader and the returned data is then being displayed. This works perfectly, except that the program execution is blocked while the shader is running (see code below) and depending on the parameters, this can take a while:

void CalculateSomething(GLfloat* Result)
{
    // load some uniform variables
    glDispatchCompute(X, Y, 1);
    glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
    GLfloat* mapped = (GLfloat*)(glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY));
    memcpy(Result, mapped, sizeof(GLfloat) * X * Y);
    glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
}

void main
{
    // Initialization stuff
    // ...

    while (glfwWindowShouldClose(Window) == 0)
    {
        glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
        glfwPollEvents();
        glfwSwapInterval(2); // Doesn't matter what I put here

        CalculatateSomething(Result);
        Render(Result);

        glfwSwapBuffers(Window.WindowHandle);
    }
}

To keep the main loop running while the compute shader is calculating, I changed CalculateSomething to something like this:

void CalculateSomething(GLfloat* Result)
{
    // load some uniform variables
    glDispatchCompute(X, Y, 1);
    GPU_sync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
}

bool GPU_busy()
{
    GLint GPU_status;
    if (GPU_sync == NULL)
        return false;
    else
    {
        glGetSynciv(GPU_sync, GL_SYNC_STATUS, 1, nullptr, &GPU_status);
        return GPU_status == GL_UNSIGNALED;
    }
}

These two functions are part of a class and it would get a little messy and complicated if I had to post all that here (if more code is needed, tell me). So every loop when the class is told to do the computation, it first checks, if the GPU is busy. If it's done, the result is copied to CPU-memory (or a calculation is started), else it returns to main without doing anything else. Anyway, this approach works in that it produces the right result. But my main loop is still blocked.

Doing some timing revealed that CalculateSomething, Render (and everything else) runs fast (as I would expect them to do). But now glfwSwapBuffers takes >3000ms (depending on how long the calculations of the compute shader take).

Shouldn't it be possible to switch buffers while a compute shader is running? Rendering the result seems to work fine and without delay (as long as the compute shader is not done yet, the old result should get rendered). Or am I missing something here (queued OpenGL calls get processed before glfwSwapBuffers does something?)?

Edit:

I'm not sure why this question got closed and what additional information is needed (maybe other than the OS, which would be Windows). As for "desired behavior": Well - I'd like the glfwSwapBuffers-call not to block my main loop. For additional information, please ask...

As pointed out by Erdal Küçük an implicit call of glFlush might cause latency. I did put this call before glfwSwapBuffer for testing purposes and timed it - no latency here...

I'm sure, I can't be the only one who ever ran into this problem. Maybe someone could try and reproduce it? Simply put a compute shader in the main-loop that takes a few seconds to do it's calculations. I have read somewhere that similar problems occur especially when calling glMapBuffer. This seems to be an issue with the GPU-driver (mine would be an integrated Intel-GPU). But nowhere have I read about latencies above 200ms...

It looks like you are using glfwSwapInterval() to control the frame rate of your application. However, this function only affects the rate at which the buffers are swapped, not the rate at which your application runs. In other words, your application will continue to run as fast as it can, regardless of the value you pass to glfwSwapInterval(). — kppro, Dec 04 '22 at 13:01
To fix this issue, you can use a different mechanism to control the frame rate of your application. One way to do this is to use a timer to measure the time elapsed since the last frame and then use this information to decide whether to render a new frame or not. — kppro, Dec 04 '22 at 13:01
Depending on your OS, implicit flush could occur, e.g. `eglSwapBuffers`, `glxSwapBuffers` will implicitly call `glFlush`. Although `glFlush` does not wait for command completion, it still has to wait until all commands have been accepted by the GPU, which could cause latency (see: https://www.khronos.org/opengl/wiki/Swap_Interval). You could decouple the computing stage from the surface via a separate context (shared context, unfortunately only via second window in glfw), maybe it helps. — Erdal Küçük, Dec 06 '22 at 01:31
@ErdalKüçük is correct, by accessing the glfw3 [source code](https://github.com/glfw/glfw/blob/master/src/glx_context.c#L187), the `glfwSwapBuffers` implicitely calls `glxSwapBuffers`. In the [khronos glxSwapBuffers page](https://registry.khronos.org/OpenGL-Refpages/gl2.1/xhtml/glXSwapBuffers.xml) they tell about the problem you're facing on the *Notes* section. You must solve this yourself as it seems by executing _glFinish_ and using semaphores as written in that section. Hope it helps. Also, try reading the glfw doc, it's worth it, might give you some ideas on how to proceed. — Carl HR, Feb 26 '23 at 03:55

score -1 · Answer 1 · answered Feb 20 '23 at 21:17

-1

Solved a similar issue with GL_PIXEL_PACK_BUFFER effectively used as an offscreen compute shader. The approach with fences is correct, but you then need to have a separate function that checks the status of the fence using glGetSynciv to read the GL_SYNC_STATUS. The solution (admittedly in Java) can be found here.

An explanation for why this is necessary can be found in: in @Nick Clark's comment answer:

Every call in OpenGL is asynchronous, except for the frame buffer swap, which stalls the calling thread until all submitted functions have been executed. Thus, the reason why glfwSwapBuffers seems to take so long.

The relevant portion from the solution is:

public void finishHMRead( int pboIndex ){
    int[] length = new int[1];
    int[] status = new int[1];
    GLES30.glGetSynciv( hmReadFences[ pboIndex ], GLES30.GL_SYNC_STATUS, 1, length, 0, status, 0 );
    int signalStatus = status[0];
    int glSignaled   = GLES30.GL_SIGNALED;
    if( signalStatus == glSignaled ){
        // Ready a temporary ByteBuffer for mapping (we'll unmap the pixel buffer and lose this) and a permanent ByteBuffer
        ByteBuffer pixelBuffer;
        texLayerByteBuffers[ pboIndex ] = ByteBuffer.allocate( texWH * texWH );

        // map data to a bytebuffer
        GLES30.glBindBuffer( GLES30.GL_PIXEL_PACK_BUFFER, pbos[ pboIndex ] );
        pixelBuffer = ( ByteBuffer ) GLES30.glMapBufferRange( GLES30.GL_PIXEL_PACK_BUFFER, 0, texWH * texWH * 1, GLES30.GL_MAP_READ_BIT );
        
        // Copy to the long term ByteBuffer
        pixelBuffer.rewind(); //copy from the beginning
        texLayerByteBuffers[ pboIndex ].put( pixelBuffer );
        
        // Unmap and unbind the currently bound pixel buffer
        GLES30.glUnmapBuffer( GLES30.GL_PIXEL_PACK_BUFFER );
        GLES30.glBindBuffer( GLES30.GL_PIXEL_PACK_BUFFER, 0 );
        Log.i( "myTag", "Finished copy for pbo data for " + pboIndex + " at: " + (System.currentTimeMillis() - initSphereStart) );
        acknowledgeHMReadComplete();
    } else {
        // If it wasn't done, resubmit for another check in the next render update cycle
        RefMethodwArgs finishHmRead = new RefMethodwArgs( this, "finishHMRead", new Object[]{ pboIndex } );
        UpdateList.getRef().addRenderUpdate( finishHmRead );
    }
}

Basically, fire off the computer shader, then wait for the glGetSynciv check of GL_SYNC_STATUS to equal GL_SIGNALED, then rebind the GL_SHADER_STORAGE_BUFFER and perform the glMapBuffer operation.

answered Feb 20 '23 at 21:17

G. Putnam

1,262
5
10

Thank you for the answer. My Java is a bit rusty, though. But if I understand your approach correcty, isn't that exactly what I tried with my GPU_Busy function? I did the glMapBuffer operation when the GPU wasn't busy anymore. Admittedly I don't remember the details, because I gave up this approach and don't have the code anymore, as no answer was given here and I could not get it working otherwise. But if I remember correctly it did work and gave correct results - but still the main-loop was blocked. I will try again tomorrow and report back the result, though. – Paul Aner Feb 20 '23 at 22:53
Apologies, missed the effect of `glGetSynciv` since it didn't look like anything was being done with the result, such as a callback to `glfwSwapBuffers` whenever your `GPU_busy()` would return true. Think the general approach is correct though. Was almost the exact issue I had with `glMapBuffer` and `GL_PIXEL_PACK_BUFFER`. Suggest making the call to `glfwSwapBuffers` be a callback that only executes if `GPU_status == GL_SIGNALED` – G. Putnam Feb 21 '23 at 01:01
OK, I am about to try this now, but I am afraid that this was the issue back then: When I make a callback to `glfwSwapBuffers` then the screen would only be updated AFTER the compute shader is done (right?). So the effect would be the same: the main-loop/screen update is stalled until the compute shader completes? Could it really be that a sensible refresh-rate can only be achived by opening a second context (I would not like a second window) or by splitting computations for the shader into smaller snippets? – Paul Aner Feb 21 '23 at 09:15
I'm not sure how your screen would be updated Before the shader completes? Most of the calculations are asynchronous, but ultimately they're going to be bound by how fast your compute shader can actually calculate frames. If the frame's not done, there's nothing to swap. Might be worth timing how long the compute shader itself is taking. Part of the issue I think you're running into is how opengl timing in `CalculateSomething` works. It doesn't actually report the Total shader time, only time to Start the job. Need a timing trigger when it finishes. Note, calculations are asynchronous. – G. Putnam Feb 21 '23 at 16:49
I also recommend looking at [this thread](https://computergraphics.stackexchange.com/questions/9956/performance-of-compute-shaders-vs-fragment-shaders-for-deferred-rendering) on computer shaders vs fragments shaders. Although computer shaders CAN be faster than fragment shaders, you really have to know what you're doing, and in many cases you'll more easily get fast frame rate using the already vendor optimized fragment shader pixel routines and a framebuffer. – G. Putnam Feb 21 '23 at 17:09
I actually read your link a couple of weeks ago. And if you scroll down, the conclusion is pretty much that a compute shader can (/ often will) be MUCH faster. I tried a FS once (altough with a simpler approach, but my guess would be in my case a CS will ALWAYS be faster) and it was slower. However: I think when timing a CS it is sufficient if a call to `glMapBuffer` is done - this has to wait until the CS is finished anyway. – Paul Aner Feb 21 '23 at 17:34
As to what to display when the CS is not done: Simply display the old (not yet updated) data in another array. The main problem is that when the main-loop is blocked, ImGUI doesn't update and a change of parameters is basically impossible (at least it's a real pain). – Paul Aner Feb 21 '23 at 17:36

glfwSwapBuffers slow (>3s) / How to make main-loop run smoothly while compute shader does longer calculations?

1 Answers1