Gaining an understanding of performance implications of shader stages, particularly the GS

Question

I am confused about what's faster versus what's slower when it comes to coding algorithms that execute in the pipeline.

I made a program with a GS that seemingly bottlenecked from fillrate, because timer queries showed it to execute much faster with no rasterisation enabled.

So then I made a different multi-pass algorithm using transform feedback, still using a GS every time but theoretically does much less work overall by executing in stages, and it significantly reduces the fill rate because it renders much less triangles, but in my early tests of it, it appears to run slower.

My original thought was that the bottleneck of fillrate was traded for the bottleneck of calling multiple draw calls. But how expensive is another draw call really? How much overhead is involved in the cpu and gpu?

Then I read the answer of a different stack question regarding the GS:

No one has ever accused Geometry Shaders of being fast. Especially when increasing the size of geometry.

Your GS is taking a line and not only doing a 30x amplification of vertex data, but also doing lighting computations on each of those new vertices. That's not going to be terribly fast, in large part due to a lack of parallelism. Each GS invocation has to do 60 lighting computations, rather than having 60 separate vertex shader invocations doing 60 lighting computations in parallel.

You're basically creating a giant bottleneck in your geometry shader.

It would probably be faster to put the lighting stuff in the fragment shader (yes, really).

and it makes me wonder how it's possible for a geometry shaders to be slower if their use provides an overall less work output. I know things execute in parallel, but my understanding is that there is only a relatively small group of shader cores, so starting an amount of threads much larger than that group will result in the bottleneck being something proportional to program complexity (instruction size) times the number of threads (using thread here to refer to invocation of a shader). If you can have some instruction execute once per vertex on the geometry shader instead of once per fragment, why would it ever be slower?

Help me gain a better understanding so I don't waste time designing algorithms that are inefficient.

For one thing, vertex shaders and fragment shaders emit exactly one vertex or fragment. Geometry shaders are capable of emitting a variable number of primitives and vertices at run-time. That means that running them in parallel is more complicated, especially since the output primitives need to be shaded in a certain order to produce a consistent final image. — Andon M. Coleman, May 28 '14 at 16:38
I think GS executes once per primitive, not once per vertex (unless you render GL_POINTS). It recreates the output primitives. I believe whatever the GS is, adding one will always slow down your overall pass. Even if you actually discard primitives in the end, the fact that a piece of code checked if the primitive had to be discarded (not re emited) will be time consuming. Time you could save by discarding fragment instead with no GS at all. But GS IS useful, don't get me wrong. For feedback buffers for example, if you don;t have a CS at hand, it is an helpful feature. — agrum, May 28 '14 at 19:02

Gaining an understanding of performance implications of shader stages, particularly the GS

0 Answers0