How to reuse vertices across primitives in OpenGL

Question

I am using OpenGL in C++ (technically EGL, on a Jetson Nano.)

Let's say I want to draw N Quads. Imagine just a list of colored rectangles. There may be a few thousand such rectangles in the frame.

I want to use two vertex buffers:

One that defines the geometry of each quad.
One that defines the properties common to each quad.

The first vertex buffer should define the geometry of each quad. It should have only 4 vertices in it and its data would be just the corners of a quad. Something like:

0, 0, // top left
1, 0, // top right
0, 1, // bottom left
1, 1, // bottom right

Then the second vertex buffer should have just the x,y,width,height of all the rectangles.

x1, y1, width1, height1, color1,
x2, y2, width2, height2, color2,
x3, y3, width3, height3, color3,
x4, y4, width4, height4, color4,
x5, y5, width5, height5, color5,
x6, y6, width6, height6, color6,
... etc.

The thing is that each one of the items in my rectangle buffer should apply to 4 vertices in the vertex buffer.

Is there a way to set this up so that it keeps reusing the same 4 quad vertices over and over for each rectangle and applies the same rectangle properties to 4 vertices at a time?

I'm imagining there's something I can do so that I say that the first vertex buffer should use one element per vertex and wraps around, but the second vertex buffer uses one element per every four vertices or something like that.

How do I set this up?

What I do now:

Right now I need one vertex buffer that just has the quad vertices repeated over and over as many times as I have instances.

0, 0, // (1) top left
1, 0, // 
0, 1, // 
1, 1  // 
0, 0, // (2) top left
1, 0, // 
0, 1, // 
1, 1, // 
0, 0, // (3) top left
1, 0, // 
0, 1, // 
1, 1, // 
... etc

And my second buffer duplicates its data for each vertex:

x1, y1, width1, height1, color1,
x1, y1, width1, height1, color1,
x1, y1, width1, height1, color1,
x1, y1, width1, height1, color1,
x2, y2, width2, height2, color2,
x2, y2, width2, height2, color2,
x2, y2, width2, height2, color2,
x2, y2, width2, height2, color2,
x3, y3, width3, height3, color3,
x3, y3, width3, height3, color3,
x3, y3, width3, height3, color3,
x3, y3, width3, height3, color3,
x4, y4, width4, height4, color4,
x4, y4, width4, height4, color4,
x4, y4, width4, height4, color4,
x4, y4, width4, height4, color4,
x5, y5, width5, height5, color5,
x5, y5, width5, height5, color5,
x5, y5, width5, height5, color5,
x5, y5, width5, height5, color5,
x6, y6, width6, height6, color6,
x6, y6, width6, height6, color6,
x6, y6, width6, height6, color6,
x6, y6, width6, height6, color6,
... etc.

This seems really inefficient and I just want to specify the first 4 vertices once and have it keep reusing them somehow rather than duplicating these 4 vertices N times to have a total of 4*N vertices in my first buffer. And I only want to specify the x,y,width,height,color attributes once for each quad for a total of N vertices, and not once for each overall vertex for a total of 4*N vertices.

What do I do?

What problem is being solved relative to doing things the normal way? — Nicol Bolas, Oct 16 '19 at 18:43
@NicolBolas What's the normal way? The problem I'm solving is updating only N vertices per frame rather than 4*N vertices per frame. — Wyck, Oct 16 '19 at 18:52
Yes, the normal way is updating 4*N vertices per frame. And I don't mean your "x,y, w, h" stuff; I mean standard "position + color" for each vertex. My question was why you feel that this is wrong; what problem are you trying to solve by doing it this way rather than the expected way? — Nicol Bolas, Oct 16 '19 at 19:26
@NicolBolas If I wanted to do dodecahedra instead of rectangles, would it be more relevant? ...if there were 100+ vertices in the shape that I were drawing in place of the quad? Would this technique make sense then? It is just because a quad is so simple that you're having trouble accepting my motivation for doing this? It seems obvious to me that it's better to upload only N vertices per frame rather than N*M vertices per frame. I just need to know the technique: how to tell OpenGL how the buffers are set up. — Wyck, Oct 16 '19 at 20:02
@httpdigest that sounds useful. Do instances actually come from a second buffer? I was under the impression that instance data gets passed in as an array via a uniform, essentially. — Wyck, Oct 16 '19 at 20:05
@Wyck: "*If I wanted to do dodecahedra instead of rectangles, would it be more relevant? ...if there were 100+ vertices in the shape that I were drawing in place of the quad?*" Yes, those would be more relevant because you'd be looking at a performance issue: how to draw a (presumably) large number of mid-side meshes. But even then, it would be contingent on *being a performance issue*. That is, the answer to the question I'm asking would be "because I've profiled the normal code and found it to be slow". — Nicol Bolas, Oct 16 '19 at 20:05
@Wyck: Because if you don't know that it's an actual performance problem, then the solution you employ may well make your code *slower*. — Nicol Bolas, Oct 16 '19 at 20:06
@NicolBolas My problem is that I want to compare technique 1 and technique 2 and I don't know how to create an implementation of technique 2 to even test against my technique 1 implementation. — Wyck, Oct 16 '19 at 20:09
@Wyck: The problem is that your "technique 1", if it's what you described, is the *wrong technique*. Comparing your hypothetical "technique 2" to it would be wrong, because you're not starting from the best version of your current mechanism. That is, technique 2 would only be faster because technique 1 is bad, not because technique 2 is good. — Nicol Bolas, Oct 16 '19 at 20:11
@Wyck: Exactly what I said: "I don't mean your "x,y, w, h" stuff; I mean standard "position + color" for each vertex." Stop trying to send per-instance data as a separate channel; just provide the position and color for the 4 quad vertices. — Nicol Bolas, Oct 16 '19 at 20:16
Why would I ask the CPU to compute per instance data when the GPU can do all that math for me? — Wyck, Oct 16 '19 at 20:17
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/200976/discussion-between-nicol-bolas-and-wyck). — Nicol Bolas, Oct 16 '19 at 20:18

score 2 · Answer 1 · answered Oct 16 '19 at 21:22

Generally speaking, the most efficient way to render a series of quads is to... render a series of quads. You don't send width/height or other per-instance information; you compute the actual positions of the 4 vertices on the CPU and you write them to GPU memory using appropriate buffer object streaming techniques. Specifically, avoid trying to change only a few quads; if your data isn't static, it's probably going to be better to re-upload all of it (to a different/invalidated buffer) rather than modify only a few bytes in-situ.

Your hypothetical alternative would only perform better in two scenarios: if the bandwidth of writing data to the GPU is your current bottleneck (whether due to quads or some other transfers you're doing) or if the bandwidth of reading data for rendering is the current bottleneck.

You can mitigate this issue by reducing the size of the vertex data. Since we're talking 2D quads, you could very well use shorts for the XY position of each vertex. Or 16-bit floats. Either way, this means that each vertex (position + color) only takes up 8 bytes, which means a quad is just 32-bytes of data. Obviously 12 bytes is less than 32 (12 being the per-instance cost if you use similar compression), but it's still a 33% reduction over the 48 bytes that full float positions would use.

If you have done your profiling homework and have determined that 32-bytes-per-quad is too much, vertex instancing is still a bad idea. It is well known that, on some hardware, extremely small instances can kill your VS performance. Therefore, it should be avoided.

In this case, it may be best to forgo all vertex attribute usage (your VAO should have all arrays disable and your VS should have no in values defined). Instead, you should fetch instance data directly from SSBOs.

The gl_VertexID input value tells you what vertex index is being rendered. Given that you're rendering quads, the current instance would be gl_VertexID / 4. And the current vertex within the quad is gl_VertexID % 4. So your VS would look something like this:

struct instance
{
  vec2 position;
  vec2 size;
  uint color; //Packed as 4 bytes; unpack with unpackUnorm4x8
  uint padding; //Padding needed due to alignment/stride of 8 bytes.
};

layout(binding = 0, std430) buffer instance_data
{
  instance instances[];
};

vec2[4] vertex_table =
{
  vec2{0, 0},
  vec2{1, 0},
  vec2{0, 1},
  vec2{1, 1},
};

void main()
{
    instance curr_instance = instances[gl_VertexID / 4];
    vec2 vertex = vertex_table[gl_VertexID % 4];

    vertex = curr_instance.position + (curr_instance.size * vertex);
    gl_Position = vec4(vertex.xy, 0.0, 1.0);
}

How fast this sort of thing will be depends entirely on how well your GPU handles these kinds of global memory reads. Note that it is at least hypothetically possible to reduce the size of the per-instance data back to 12. You can pack the position and size into two 16-bit shorts or half-floats, using unpackUnorm2x16 or unpackHalf2x16 to unpack these values, respectively. If you do this, then your instance struct is just 3 uint values, and there is no need for padding.

How to reuse vertices across primitives in OpenGL

What I do now:

1 Answers1