OpenGL: Geometry Shader performance with a lot of cubes

Question

So I wrote a really simple OpenGL program to draw 100x100x100 points drawn as cubes using the Geometry Shader. I wanted to do it to benchmark it against what I could currently do using DirectX11.

With DirectX11, I can easily render these cubes at 60fps (vsync). However, with OpenGL I'm stuck at 40fps.

In both applications, I am:

Using a point tolopology to represent just the position of the cube (stride = 12 bytes).
Only mapping to the Vertex Buffer in the initialise function, only ever once.
Using only two draw calls in total: one to render the cubes, one to render frametime.
Using back-face culling, and depth testing.
Limiting state changes to the minimum I need to draw the cubes (VBO's/Shader Program).

Here is my draw call:

    GLboolean CCubeApplication::Draw()
    {
        auto program = m_ppBatches[0]->GetShaders()->GetProgram(0);

        program->Bind();
        {
            glUniformMatrix4fv(program->GetUniform("g_uWVP"), 1, false, glm::value_ptr(m_matMatrices[MATRIX_WVP]));
            glDrawArrays(GL_POINTS, 0, m_uiTotal);
        }

        return true;
    }

This function calls glBindVertexArray and glUseProgram

program->Bind();

And the rest is straight-forward. My Update function does nothing but update the camera's position and view matrix, and is identical in DirectX/OpenGL versions.

My vertex shader is a pass-through, and my fragment shader returns a constant colour. This is my geometry shader:

#version 440 core

// GS_LAYOUT
layout(points) in;
layout(triangle_strip, max_vertices = 36) out;

// GS_IN
in vec4 vOut_pos[];

// GS_OUT

// UNIFORMS
uniform mat4 g_uWVP;
const float f = 0.1f;

const int elements[] = int[]
(
    0,2,1,
    2,3,1,

    1,3,5,
    3,7,5,

    5,7,4,
    7,6,4,

    4,6,0,
    6,2,0,

    3,2,7,
    2,6,7,

    5,4,1,
    4,0,1
);

// GS
void main()
{
    vec4 vertices[] = vec4[]
    (
        g_uWVP * (vOut_pos[0] + vec4(-f,-f,-f, 0)),
        g_uWVP * (vOut_pos[0] + vec4(-f,-f,+f, 0)), 
        g_uWVP * (vOut_pos[0] + vec4(-f,+f,-f, 0)), 
        g_uWVP * (vOut_pos[0] + vec4(-f,+f,+f, 0)), 
        g_uWVP * (vOut_pos[0] + vec4(+f,-f,-f, 0)), 
        g_uWVP * (vOut_pos[0] + vec4(+f,-f,+f, 0)), 
        g_uWVP * (vOut_pos[0] + vec4(+f,+f,-f, 0)), 
        g_uWVP * (vOut_pos[0] + vec4(+f,+f,+f, 0))
    );

    uint uiIndex = 0;
    for(uint uiTri = 0; uiTri < 12; ++uiTri)
    {
        for(uint uiVert = 0; uiVert < 3; ++uiVert)
        {
            gl_Position = vertices[elements[uiIndex++]];
            EmitVertex();
        }

        EndPrimitive();
    }
}

I've seen people talk about instancing or other such rendering methods, but I'm primarily interested in understanding why I can't get at least the same performance from OpenGL as I do with DirectX - seeing as the way I do it in both seem to be virtually identical to me. Identical data, identical shaders. Help?

UPDATE So I downloaded gDEBugger, and here is my call stack for one frame:

glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT)

// Drawing cubes
glBindVertexArray(1)
glUseProgram(1)

glUniformMatrix4fv(0, 1, FALSE, {matrixData})

glDrawArrays(GL_POINTS, 0, 1000000)

// Drawing text
glBindVertexArray(2);
glUseProgram(5);

glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_2D, 2);

glBindBuffer(GL_ARRAY_BUFFER, 2);
glBufferData(GL_ARRAY_BUFFER, 212992, {textData}, GL_DYNAMIC_DRAW);

glDrawArrays(GL_POINTS, 0, 34);

// Swap buffers
wglSwapBuffers();

Changing the bound Vertex Array Object is actually more expensive than using a single VAO and changing one or two vertex pointers between draw calls. However, the overhead is not substantial, you would have to be making hundreds of calls to that function to account for a 33% drop in performance. Also, since you are discussing a situation where your framerate is limited by VSYNC, you should probably measure performance in terms of milliseconds to finish your frame instead (timer queries will help). — Andon M. Coleman, Jun 08 '14 at 12:46
I only call `CCubeApplication::Draw()` once per frame, and that draws all the particles in one batch. I have to switch the VAO twice per frame, once for drawing the particles and once again for drawing the debug text. 2 draw calls total per frame. I put my `CCubeApplication::Update()` call in a for loop that keeps updating on a timestep that targets 60hz. That means that at best, the `CCubeApplication::Draw()` call can only happen once every 1/60 seconds (and therefore, one frame per 1/60 seconds). That's how I implement my VSync, and how my DirectX application is capped at 60fps. — riandrake, Jun 08 '14 at 12:54
Querying the uniform location by name every time you draw is also pretty inefficient (but should not account for 9+ ms of additional frame time). Since the location is fixed until you re-link your program, you should just query and store the integer value when you load your program and avoid having to search for it using a string repeatedly. — Andon M. Coleman, Jun 08 '14 at 13:02
Oh I do store the integer values, I just save those values into a map where I use the name of the uniform as a key. Mainly for readability reasons. — riandrake, Jun 08 '14 at 13:04

OpenGL: Geometry Shader performance with a lot of cubes

0 Answers0