2

I'm having an unusual problem while working on an openGL project. Essentially I require frame data in GRAYSCALE single channel format for some CV stuff. I'm using a custom shader, an FBO and PBO's to get the task done.

The flow of the program is as follows.

  1. bind the generated FBO
  2. draw() to the FBO
  3. bind PBO and glReadPixels()
  4. bind PBO from previous frame and glMapBufferRange()
  5. process the provided pixel data from glMapBufferRange()

I'd like to actually confirm that the process is working fine. What i'd like to know is whether there is anything that can be done to increase the performance. I'm going to post some of the code I'm using so we can all follow.

The PBO generator code

    final int[] pbuffers = new int[2];

    GLES30.glGenBuffers(2, pbuffers, 0);

    for (int i = 0; i < pbuffers.length; i++) {
        GLES30.glBindBuffer(GLES30.GL_PIXEL_PACK_BUFFER, pbuffers[i]);
        GLES30.glBufferData(GLES30.GL_PIXEL_PACK_BUFFER, width * height, null, GLES30.GL_DYNAMIC_READ);
        GLES30.glBindBuffer(GLES30.GL_PIXEL_PACK_BUFFER, 0);
    }

    pbo_id[PBO_PRIMARY_ID] = pbuffers[0];
    pbo_id[PBO_SECONDARY_ID] = pbuffers[1];

Step 3 from the list -> bind PBO and glReadPixels()

    GLES30.glBindBuffer(GLES30.GL_PIXEL_PACK_BUFFER, pbo_id[currentBuffer]);
    GLES30.glReadBuffer(GLES30.GL_COLOR_ATTACHMENT0);

    JNI.glReadPixels(0, 0, width, height, GL_RED, GL_UNSIGNED_BYTE, 0);

    GLES30.glBindBuffer(GLES30.GL_PIXEL_PACK_BUFFER, 0);

    final int prevBuffer = previousBuffer;

    previousBuffer = currentBuffer;
    currentBuffer = prevBuffer;

Step 4 from the list -> bind PBO from previous frame and glMapBufferRange(). This is the PBO which had glReadPixels performed from last frame.

    GLES30.glBindBuffer(GLES30.GL_PIXEL_PACK_BUFFER, pbo_id[currentBuffer]);

    JNI.glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, width * height, GL_MAP_READ_BIT);

    GLES30.glUnmapBuffer(GLES30.GL_PIXEL_PACK_BUFFER);
    GLES30.glBindBuffer(GLES30.GL_PIXEL_PACK_BUFFER, 0);

And this is where the performance problem is coming from. Currently I'm reading back pixels which are 480 x 360 single channel grayscale (calculated from a shader). I've ran some benchmarks and results are below.

40-50ms -> JNI.glReadPixels(0, 0, width, height, GL_RED, GL_UNSIGNED_BYTE, 0);
0-1ms -> JNI.glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, width * height, GL_MAP_READ_BIT);

From what I understood is that glReadPixels from the PBO is not meant to be a blocking call, but for whatever reason it's blocking it here (and performing far worse than just reading from an FBO). It seems glMapBufferRange is behaving as expected, and returning the required data properly.

The only thing i can think of is that I'm using GL_RED and only reading back a single channel, but this still doesn't explain why glReadPixels is blocking.

Devices I've used for bench-marking (consistent behaviour).

  1. HTC One M8s (40-50ms)
  2. Nexus 5x (20-30ms)
  3. Google Pixel (15-30ms)

Any help in this matter would be highly appreciated! in the meantime, I'm going to try and experiment a bit more to see if there is anything obvious that i've missed.

EDIT -> 16/03/2017 (Added more code for clarity)

FBO Setup Code

    final int[] values = new int[1];
    GLES30.glGenTextures(1, values, 0);
    GLES30.glBindTexture(GLES30.GL_TEXTURE_2D, values[0]);

    // we only want GRAYSCALE / Single channel texture
    GLES30.glTexImage2D(GLES30.GL_TEXTURE_2D, 0, GLES30.GL_R8, texWidth, texHeight, 0, GLES30.GL_RED, GLES30.GL_UNSIGNED_BYTE, null);
    GLES30.glTexParameteri(GLES30.GL_TEXTURE_2D, GLES30.GL_TEXTURE_WRAP_S, GLES30.GL_CLAMP_TO_EDGE);
    GLES30.glTexParameteri(GLES30.GL_TEXTURE_2D, GLES30.GL_TEXTURE_WRAP_T, GLES30.GL_CLAMP_TO_EDGE);
    GLES30.glTexParameteri(GLES30.GL_TEXTURE_2D, GLES30.GL_TEXTURE_MIN_FILTER, GLES30.GL_NEAREST);
    GLES30.glTexParameteri(GLES30.GL_TEXTURE_2D, GLES30.GL_TEXTURE_MAG_FILTER, GLES30.GL_NEAREST);

    this.tex_id[0] = values[0];

    GLES30.glGenFramebuffers(1, values, 0);
    GLES30.glBindFramebuffer(GLES30.GL_FRAMEBUFFER, values[0]);

    this.fbo_id[0] = values[0];
    GLES30.glFramebufferTexture2D(GLES30.GL_FRAMEBUFFER, GLES30.GL_COLOR_ATTACHMENT0, GLES30.GL_TEXTURE_2D, this.tex_id[0], 0);

    final int status = GLES30.glCheckFramebufferStatus(GLES30.GL_FRAMEBUFFER);

    if (status != GLES30.GL_FRAMEBUFFER_COMPLETE) {
        Debug.LogError("Framebuffer incomplete. Status: " + status);
    }

    GLES30.glBindFramebuffer(GLES30.GL_FRAMEBUFFER, 0);

The full render code. I've deconstructed as much of the logic and flow as possible for clarity.

    // bind the offscreen FBO and render the current camera frame
    GLES30.glBindFramebuffer(GLES30.GL_FRAMEBUFFER, dualFBO.getID());
    camera.draw(ShaderType.GRAYSCALE);

    // ping-pong the FBO ID's
    dualFBO.swap();

    // dualFBO will now return the ID for last frame
    GLES30.glBindFramebuffer(GLES30.GL_FRAMEBUFFER, dualFBO.getID());

    // bind the current PB and submit (meant to be async) glReadPixels
    GLES30.glBindBuffer(GLES30.GL_PIXEL_PACK_BUFFER, dualPBO.getID());
    GLES30.glReadBuffer(GLES30.GL_COLOR_ATTACHMENT0);

    // this call locks for 30-50ms... why? (meant to be async???)
    JNI.glReadPixels(0, 0, width, height, GL_RED, GL_UNSIGNED_BYTE, 0);

    GLES30.glBindBuffer(GLES30.GL_PIXEL_PACK_BUFFER, 0);

    // ping-pong the PBO ID's.
    dualPBO.swap();

    GLES30.glBindBuffer(GLES30.GL_PIXEL_PACK_BUFFER, dualPBO.getID());

    // this call is instant
    JNI.glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, width * height, GL_MAP_READ_BIT);

    GLES30.glUnmapBuffer(GLES30.GL_PIXEL_PACK_BUFFER);
    GLES30.glBindBuffer(GLES30.GL_PIXEL_PACK_BUFFER, 0);
  • It might be worth trying to triple buffer instead of double buffer your PBOs to see if that helps. – Columbo Mar 15 '17 at 06:27
  • 1
    Thank you Columbo. I've already tried triple buffer but there was no visible performance gain. Problem with triple buffering also means that the required grayscale frame is yet another frame behind which is not ideal. – David Arayan Mar 15 '17 at 06:49
  • Not related to the slowdown, but on it is likely on many mobile implementations that the graphics memory isn't cached on the CPU, so actually trying to do CV algorithms directly on the mapped buffer is going to be horrifically slow. – solidpixel Mar 15 '17 at 11:07
  • It would be useful to get a complete reproducer here - hard to tell what you are doing from the snippets here. Are you also triple buffering the color attachments you are rendering into? If you are not then you might be serializing the GPU processing and the CPU readbacks. – solidpixel Mar 15 '17 at 11:12
  • Thank you for the comments solidpixel, I've added more code for clarity. – David Arayan Mar 15 '17 at 23:40
  • Any success here? I'm trying to do the exact same thing. – Crearo Rotar May 31 '18 at 09:42
  • G'day Crearo, Unfortunately this remained unsolved. What we found is that the speed is incredibly inconsistent not only between different android versions but also between different phone devices, IE Google Pixel is quite slow with the operation where Samsung Galaxy is quite fast. Since we were using this for Augmented Reality applications, we use ARCore now which provides a faster CPU access to the camera image than a GL readback. I hope that helps and good luck! – David Arayan Jun 03 '18 at 03:21
  • Additionally we found that glReadPixels blocks because the GPU has not actually performed a full copy in time that the call is made. We found that depending on load, the full copy takes close to 3-4 frames to complete in which case the glReadPixels actually completes in close to 0-1ms. I hope that is also important information. – David Arayan Jun 03 '18 at 03:24

0 Answers0