OpenGL Optimization - Duplicate Vertex Stream or Call glDrawElements Repeatedly?

Question

This is for an OpenGL ES 2.0 game on Android, though I suspect the right answer is generic to any opengl situation.

TL;DR - is it better to send N data to the gpu once and then make K draw calls with it; or send K*N data to the gpu once, and make 1 draw call?

More Details I'm wondering about best practices for my situation. I have a dynamic mesh whose vertices I recompute every frame - think of it as a water surface - and I need to project these vertices onto K different quads in my game. (In each case the projection is slightly different; sparing details, you could imagine them as K different mirrors surrounding the mesh.) K is in the order of 10-25; I'm still figuring it out.

I can think of two broad options:

Bind the mesh as is, and call draw K different times, either changing a uniform for shaders or messing with the fixed function state to render to the correct quad in place (on the screen) or to different segments of a texture (which I can later use when rendering the quads to achieve the same effect).
Duplicate all the vertices in the mesh K times, essentially making a single vertex stream with K meshes in it, and add an attribute (or few) indicating which quad each mesh clone is supposed to project onto (and how to get there), and use vertex shaders to project. I would make one call to draw, but send K times as much data.

The Question: of those two options, which is generally better performance wise?

(Additionally: is there a better way to do this?

I had considered a third option, where I rendered the mesh details to a texture, and created my K-clone geometry as a sort of dummy stream, which I could bind once and for all, that looked up in a vertex shader into the texture for each vertex to find out what vertex it really represented; but I've been told that texture support in vertex shaders is poor or prohibited in OpenGL ES 2.0 and would prefer to avoid that route.)

Andon M. Coleman · Answer 1 · 2013-08-07T21:37:44.483

There is no perfect answer to this question, though I would suggest you think about the nature of real-time computer graphics and the OpenGL pipeline. Although "the GL" is required to produce results that are consistent with in-order execution, the reality is that GPUs are highly parallel beasts. They employ lots of tricks that work best if you actually have many unrelated tasks going on at the same time (some even split the whole pipeline up into discrete tiles). GDDR memory, for instance is really high latency, so for efficiency GPUs need to be able to schedule other jobs to keep the stream processors (shader units) busy while memory is fetched for a job that is just starting.

If you are recomputing parts of your mesh each frame, then you will almost certainly want to favor more draw calls over massive CPU->GPU data transfers every frame. Saturating the bus with unnecessary data transfers plagues even PCI Express hardware (it is far slower than the overhead that several additional draw calls would ever add), it can only get worse on embedded OpenGL ES systems. Having said that, there is no reason you could not simply do glBufferSubData (...) to stream in only the affected portions of your mesh and continue to draw the entire mesh in a single draw call.

You might get better cache coherency if you split (or partition the data within) the buffer and/or draw calls up, depending on your actual use case scenario. The only way to decisively tell which is going to work better in your case is to profile your software on your target hardware. But all of this fail to look at the bigger picture, which is: "Why am I doing this on the CPU?!"

It sounds like what you really want is simply vertex instancing. If you can re-work your algorithm to work completely in vertex shaders by passing instance IDs you should see a massive improvement over all of the solutions I have seen you propose so far (true instancing is actually somewhere between what you described in solutions 1 and 2) :)

The actual concept of instancing is very simple and will give you benefits whether your particular version of the OpenGL API supports it at the API level or not (you can always implement it manually with vertex attributes and extra vertex buffer data). The thing is, you would not have to duplicate your data at all if you implement instancing correctly. The extra data necessary to identify each individual vertex is static, and you can always change a shader uniform and make an additional draw call (this is probably what you will have to do with OpenGL ES 2.0, since it does not offer glDrawElementsInstanced) without touching any vertex data.

You certainly will not have to duplicate your vertices K*N times, your buffer space complexity would be more like O (K + K*M), where M is the number of new components you had to add to uniquely identify each vertex so that you could calculate everything on the GPU. For "instance," you might need to number each of the vertices in your quad 1-4 and process the vertex differently in your shader depending on which vertex you're processing. In this case, the M coefficient is 1 and it does not change no matter how many instances of your quad you need to dynamically calculate each frame; N would determine the number of draw calls in OpenGL ES 2.0, not the size of your data. None of this additional storage space would be necessary in OpenGL ES 2.0 if it supported gl_VertexID :(

Instancing is the best way to make effective use of the highly-parallel GPU and avoid CPU/GPU synchronization and slow bus transfers. Even though OpenGL ES 2.0 does not support instancing in the API sense, multiple draw calls using the same vertex buffer where the only thing you change between calls are a couple of shader uniforms is often preferable to computing your vertices on the CPU and uploading new vertex data every frame or having your vertex buffer's size depend directly on the number of instances you intend to draw (yuck). You'll have to try it out and see what your hardware likes.

score 0 · Answer 2 · answered Aug 07 '13 at 21:27

0

Instancing would be what you are looking for but unfortunately it is not available with OpenGL ES 2.0. I would be in favor of sending all the vertices to the GPU and make one draw call if all your assets can fit into the GPU. I have an experience of reducing draw calls from 100+ to 1 and the performance went from 15 fps to 60 fps.

answered Aug 07 '13 at 21:27

radarhead

668
3
13

For anybody else who thinks like me: temporarily I was getting excited I might use gl_VertexId and duplication in the index buffer only (which could be bound once and only once) to derive a sort of hacked instance id in my shader, but http://stackoverflow.com/questions/10044185/opengles-2-0-gl-vertexid-equivalent clarifies that gl_VertixId is only available in opengl-es-3.0+ :( – jdowdell Aug 07 '13 at 22:03
Right, I mentioned that in my answer :) It is really unfortunate, but the good news is that the expense of a draw call is not always what you would think. In Direct3D minimizing draw calls was important because each time you made a draw call, it had to switch from user-mode to kernel-mode, which invoked a lengthy context switch. Naturally, D3D adopted instancing long before OpenGL :) In the OpenGL world, part of the expense of draw calls is actually deferred state setup - all the states and commands you queued between the last call and the current call form a large portion of the expense. – Andon M. Coleman Aug 07 '13 at 22:06
Two successive draw calls, using the same vertex and index buffers, where the only state(s) you change are shader states can actually be quite cheap. So you'll really have to give things a try before you stereotype a draw call as your performance bottleneck. – Andon M. Coleman Aug 07 '13 at 22:07

OpenGL Optimization - Duplicate Vertex Stream or Call glDrawElements Repeatedly?

2 Answers2