Performance issue with glDrawArraysInstanced

Question

I'm trying to implement an OpenGL4 instanced drawing algorithm where each instance is composed by a single triangle. The main reasons why I want to implement this kind of algorithm are:

the ability to use less GPU memory in the frequent scenario where colors are given on a per-triangle basis and not on a per vertex basis
the ability to perform per-triangle computations without using geometry shaders which, from my experiments, dramatically slow down the whole pipeline

My rendering program is composed by a vertex shader and a fragment shader. The vertex shader is as follows:

#version 400 core

layout (location = 0) in vec3 tri_p0;
layout (location = 1) in vec3 tri_p1;
layout (location = 2) in vec3 tri_p2;
layout (location = 3) in vec4 tri_colorP0;
layout (location = 4) in vec4 tri_colorP1;
layout (location = 5) in vec4 tri_colorP2;

out FRAGMENT {
    vec4 color;
} vs_out;

uniform mat4 mvp_matrix;

void main(void) {
    vec3 position;
    vec4 color;

    if(gl_VertexID == 0) {
        position = tri_p0;
        color = tri_colorP0;
    }
    else if(gl_VertexID == 1) {
        position = tri_p1;
        color = tri_colorP1;
    }
    else if(gl_VertexID == 2) {
        position = tri_p2;
        color = tri_colorP2;
    }

    vs_out.color = color;

    gl_Position = mvp_matrix * vec4(position, 1.0);
}

The fragment shader is instead this one:

#version 400 core

layout (location = 0) out vec4 color;

in FRAGMENT {
    vec4 color;
} fs_in;

void main(void) {
    color = fs_in.color;
}

As you can see, in my vertex shader I declare three vertex attributes for the vertex positions and three vertex attributes for the colors. All these attributes are instanced and their divisor is set to 1.

The reason why I have three color attributes is that sometimes I want to be able to have different colors for the three triangle vertices while, more often, I have a single color for the whole triangle. In this last scenario, I simply attach the three color attributes to the same VBO specifying the same stride and offset.

I wrote a test application that draws a matrix of quads, each of them composed by two triangles. This is the code I used to initialize vertex data:

int numQuadsPerRowCol = sqrtl(NUM_TRIANGLES / 2);
numTris = numQuadsPerRowCol * numQuadsPerRowCol * 2;

float stepX = (maxX - minX) / numQuadsPerRowCol;
float stepY = (maxY - minY) / numQuadsPerRowCol;

GLfloat* positions = new GLfloat[3 * 3 * numTris];
GLfloat* colors = new GLfloat[4 * numTris];

int k = 0;
int l = 0;

for (int i = 0; i < numQuadsPerRowCol; i++) {
    for (int j = 0; j < numQuadsPerRowCol; j++) {
        GLfloat color[4];

        int id = i * numQuadsPerRowCol + j;

        color[0] = ((id & 0x00ff0000) >> 16) / 255.0;
        color[1] = ((id & 0x0000ff00) >> 8) / 255.0;
        color[2] = (id & 0x000000ff) / 255.0;
        color[3] = 1.0;

        for (int t = 0; t < 2; t++) {
            for (int c = 0; c < 4; c++) {
                colors[l + c] = color[c];
            }
            l += 4;
        }

        GLfloat xLeft = minX + j * stepX;
        GLfloat xRight = minX + (j + 1) * stepX;
        GLfloat yBottom = minY + i * stepY;
        GLfloat yTop = minY + (i + 1) * stepY;

        //first triangle positions
        positions[k++] = xLeft;
        positions[k++] = yTop;
        positions[k++] = 0;

        positions[k++] = xLeft;
        positions[k++] = yBottom;
        positions[k++] = 0;

        positions[k++] = xRight;
        positions[k++] = yBottom;
        positions[k++] = 0;

        //second triangle positions
        positions[k++] = xLeft;
        positions[k++] = yTop;
        positions[k++] = 0;

        positions[k++] = xRight;
        positions[k++] = yBottom;
        positions[k++] = 0;

        positions[k++] = xRight;
        positions[k++] = yTop;
        positions[k++] = 0;
    }
}

glGenBuffers(1, &positionVbo);
glBindBuffer(GL_ARRAY_BUFFER, positionVbo);
glBufferData(GL_ARRAY_BUFFER, numTris * 3 * 3 * sizeof(float), positions, GL_STATIC_DRAW);

glVertexAttribPointer(TRI_P0, 3, GL_FLOAT, GL_FALSE, 9 * sizeof(GLfloat), NULL);
glVertexAttribDivisor(TRI_P0, 1);
glEnableVertexAttribArray(TRI_P0);

glVertexAttribPointer(TRI_P1, 3, GL_FLOAT, GL_FALSE, 9 * sizeof(GLfloat), (void *)(3 * sizeof(GLfloat)));
glVertexAttribDivisor(TRI_P1, 1);
glEnableVertexAttribArray(TRI_P1);

glVertexAttribPointer(TRI_P2, 3, GL_FLOAT, GL_FALSE, 9 * sizeof(GLfloat), (void *)(6 * sizeof(GLfloat)));
glVertexAttribDivisor(TRI_P2, 1);
glEnableVertexAttribArray(TRI_P2);

glGenBuffers(1, &colorVbo);
glBindBuffer(GL_ARRAY_BUFFER, colorVbo);
glBufferData(GL_ARRAY_BUFFER, numTris * 4 * sizeof(float), colors, GL_STATIC_DRAW);

//All color attributes are attached to the same VBO with the same stride and offset --> per-triangle colors
glVertexAttribPointer(TRI_COLOR_P0, 4, GL_FLOAT, GL_FALSE, 0, NULL);
glVertexAttribDivisor(TRI_COLOR_P0, 1);
glEnableVertexAttribArray(TRI_COLOR_P0);

glVertexAttribPointer(TRI_COLOR_P1, 4, GL_FLOAT, GL_FALSE, 0, NULL);
glVertexAttribDivisor(TRI_COLOR_P1, 1);
glEnableVertexAttribArray(TRI_COLOR_P1);

glVertexAttribPointer(TRI_COLOR_P2, 4, GL_FLOAT, GL_FALSE, 0, NULL);
glVertexAttribDivisor(TRI_COLOR_P2, 1);
glEnableVertexAttribArray(TRI_COLOR_P2);

glBindBuffer(GL_ARRAY_BUFFER, 0);

As you can see I use a single VBO for positions but each position attribute is connected to the VBO using a different offset.

For colors, I use a single VBO and all color attributes are connected using the same stride and offset (thus achieving per-triangle colors instead of per-vertex colors).

The rendering loop is as follows:

glUseProgram(render_program);

glUniformMatrix4fv(uniforms.mvp_matrix, 1, GL_FALSE, proj_matrix * view_matrix);

glDrawArraysInstanced(GL_TRIANGLES, 0, 3, numTris);

I tested the application on an integrated Intel HD 4400 card and on an Nvidia GeForce GT 750M card. Surprisingly, the performances are way better on the Intel card than on the Nvidia one. Here are some fps stats:

800000 triangles:

Intel: 140 fps
Nvidia: 31fps

1600000 triangles:

Intel: 74 fps
Nvidia: 16 fps

Does anybody have any advice on how to improve performance on the Nvidia card? Do you think that using TBOs for positions and colors would give me a performance gain?

UPDATE:

To better understand the issue, I profiled the application under windows using GPUView. I noticed quite a different behavior between Intel and Nvidia.

Intel generates a single big DMA packet (8 kB) per frame that gets executed quite fast. Nvidia, instead, generated a way bigger number of small packets (4-8 bytes) at each frame that get queued up and, for this reason, they have to wait a lot of time before being executed.

This information made me wonder whether this might be an Nvidia driver bug. Do you think this is possible?

Though "triangle with position and color" sounds just like something the geo shader should work perfectly well with. That might very well be faster since it works "in one go". Problem with instancing is that it's not free, it requires some obscure tampering with the GPU for every instance. On the Intel HD, CPU = GPU, so that's no biggie, but on the nv card, well you see yourself. The usual recommendation is to use instancing with models that have at least several dozen (or around 100 or so) vertices. — Damon, Apr 08 '14 at 14:32
Hi Damon, thanks for your answer. Actually, though, I think this might be an Nvidia specific issue. In fact, I also tried the application on an ATI card and the FPS, as I expected, were even better than on the Intel card. I also profile the application using GPUView and I think I discovered something interesting. Please see the updates in the original post for the details. — l.moretto, Apr 08 '14 at 19:03
Re your update, it should actually not do either of these. From what you describe, Intel seems to simulate instancing in software by generating a large vertex buffer from the object, times the instances. Whereas nVidia seems to re-upload the triangle for every instance. What both _should_ be doing ideally is upload the object once, and draw it, merely adjusting some pointers on the GPU after each instance if the GPU can't already do that on its own (_actually_ you'd expect it should be able to do that already!). Instancing is, after all, merely something like a `for()` loop. — Damon, Apr 09 '14 at 11:23
Everything in instancing is the same all the time, except the instance ID. So really, this should work by doing one DMA transfer and telling the GPU "do this for 1,000 instances", or if the GPU isn't capable of that, telling it 1,000 times: "do this one", and increment the instance ID in between. — Damon, Apr 09 '14 at 11:27
Yes, I agree with you. The one you described is the exact behaviour I was expecting. I'm really surprised to see that the reality is quite different. — l.moretto, Apr 09 '14 at 12:05
@Damon: Telling it 1000 times: "Do this one" matches the observed behavior perfectly. — Ben Voigt, Jun 12 '14 at 21:55

Performance issue with glDrawArraysInstanced

0 Answers0