Shader optimization: Is a ternary operator equivalent to branching?

Question

I'm working on a vertex shader in which I want to conditionally drop some vertices:

float visible = texture(VisibleTexture, index).x;
if (visible > threshold)
    gl_Vertex.z = 9999; // send out of frustum

I know that branches kill performance when there's little commonality between neighboring data. In this case, every other vertex may get a different 'visible' value, which would be bad for the performance of the local shader core cluster (from my understanding).

To my question: Is a ternary operator better (irrespective of readability issues)?

float visible = texture(VisibleTexture, index).x;
gl_Vertex.z = (visible > threshold) ? 9999 : gl_Vertex.z;

If not, is converting it into a calculation worthwhile?

float visible = texture(VisibleTexture, index).x;
visible = sign(visible - threshold) * .5 + .5; // 1=visible, 0=invisible
gl_Vertex.z += 9999 * visible; // original value only for visible

Is there an even better way to drop vertices without relying on a Geometry shader?

Thanks in advance for any help!

soulsource · Answer 1 · 2017-01-29T18:10:26.660

Actually this depends on the shader language one uses.

In HLSL and Cg a ternary operator will never lead to branching. Instead both possible results are always evaluated and the not used one is being discarded. To quote the HLSL documentation:

Unlike short-circuit evaluation of &&, ||, and ?: in C, HLSL expressions never short-circuit an evaluation because they are vector operations. All sides of the expression are always evaluated.

For Cg the situation is similar, also here the ternary conditional operator is a vector operator. (documentation):

Unlike C, side effects in the expressions in the second and third operands are always executed, regardless of the condition.
In ESSL and GLSL the ternary operator will always lead to branching. It is not a vector operator, so the condition has to evaluate to a boolean. See the GLSL specification:

It operates on three expressions (exp1 ? exp2 : exp3). This operator evaluates the first expression, which must result in a scalar Boolean. If the result is true, it selects to evaluate the second expression, otherwise it selects to evaluate the third expression. Only one of the second and third expressions is evaluated.

(Source for ESSL)

An illustration of the difference is for instance available on the Khronos WebGL test site for ternary operators.

Olhovsky · Accepted Answer · 2011-02-06T04:14:44.717

12

A ternary operator is just syntactic sugar for an if statement. They are the same.

If you had more to write inside of your if statement, there might be some optimization that could be done here, but with so little inside of either branch, there is nothing to optimize really.

Often branching is not used by default.

In your case, the ternary operator (or if statement) is probably evaluating both sides of the condition first and then discarding the branch that was not satisfied by the condition.

In order to use branching, you need to set the branching compiler flag in your shader code, to generate assembly that instructs the GPU to actually attempt to branch (if the GPU supports branching). In that case, the GPU will try to branch only if the branch predictor says that some predefined number of cores will take one of the branches.

Your mileage may vary from one compiler and GPU to another.

edited Feb 06 '11 at 04:14

answered Feb 06 '11 at 04:09

Olhovsky

5,466
3
36
47

11

A ternary operator is not just syntactic sugar. At least on x86, it's a pipelining optimization that helps with branch prediction (which would help OP here). I know this is GPUs and not CPUs, but I felt like it would be worth mentioning. – GraphicsMuncher Apr 29 '13 at 00:10
1

Compilers are smart though. – Olhovsky Apr 30 '13 at 16:02
I'm confused. I just found this answer, then I saw this: https://www.reddit.com/r/GraphicsProgramming/comments/2w72vr/askglsl_which_is_faster_ternary_conditional_or/coogrfz . Is that to say that guy is mistaken? – Aiman Al-Eryani Apr 30 '16 at 09:29
2

I am not really sure whether the OP's code represents GLSL or HLSL, but I assume GLSL if all user defined functions are present (so I added the GLSL tag for clarity). In case of HLSL, however, this answer is wrong. The condition of an if statement needs to evaluate to a scalar, whereas the ? : construct may be used for both scalars **and** vectors. – Matthias Oct 26 '17 at 09:37
@GraphicsMuncher x86 doesn't even apply to all CPUs, let alone GPU instruction sets. gpus are not going to have ternary logic which takes up valuable compute realistate per thread or group of threads, they have concepts to avoid branching much beyond that (selection mechanisms inside thread managers ie SMs). You cannot explicitly access this functionality normally, and it is *very* vendor dependent. You cannot make any statements about what a GPU will or wont do with out benchmarks and looking at intermediary representations (SPIR-V and CUDA PTX) or reverse engineering the GPUs themselves – Krupip Apr 06 '18 at 14:56

score 10 · Answer 3 · answered Sep 13 '11 at 09:06

10

This mathematical solution may be used for your replacement of conditional statements. This is also implemented in OpenCL as bitselect(condition, falsereturnvalue, truereturnvalue);

int a = in0[i], b = in1[i];
int cmp = a < b; //if TRUE, cmp has all bits 1, if FALSE all bits 0
// & bitwise AND
// | bitwise OR
// ~ flips all bits
out[i] = (a&cmp) | (b&~cmp); //a when TRUE and b when FALSE

I am however unsure about implementing this in your situation , I'm not sure I fully understood your code, but I do hope supplying you with this answer would help, or others.

answered Sep 13 '11 at 09:06

Mnescat

195
1
13

Very nice and elegant solution! – Anton Shkurenko Oct 22 '15 at 20:14
This is not to my credit. I fear I did not add the source but now I no longer know where it is. So kudos to the original creator of the idea. – Mnescat Nov 17 '16 at 15:53
Note that bitwise operations are not actually fast on GPUs. At least on Nvidia GPUs you'll be getting the same throughput as integer addition which can be slower that floating point addition. You'd have to benchmark to make sure, since latency hiding may obscure this fact and make it irrelevant, but regardless, you are doing two texture loads, which are cached initially but still limit performance. Then you do a compare, and, or, and, inv, and finally a store, you are talking about two large operations and 6 nominal operations. This is probably not an optimization on conventional gpus. – Krupip Apr 06 '18 at 14:50
@Mnescat https://www.sharcnet.ca/events/ss2010/courses/opencl/ss2010_opencl.pdf might be where you found it. – AMDG Aug 27 '18 at 01:04

score 2 · Answer 4 · edited Aug 19 '17 at 15:14

2

The answer depends on three things:

compiler and what kinds of optimisations it performs
the architecture and language
the exact situation in which you are using the ternary operator.

Consider this example:

int a = condition ? 100 : 0;

In this case, a typical compiler on a typical architecture might be able to eliminate a branch assuming booleans are represented as integers. The code could be translated to

int a = condition * 100

The same kind of optimisation might be possible with an equivalent if condition:

int a = 0;

if (condition) {
    a = 100;
}

It all depends on particular optimisations performed by the compiler.

Generally speaking, my advice is: If you can use a ternary operator, it is preferable to use it. It is more likely to be optimised by the compiler. It also results in a more declarative style of code.

edited Aug 19 '17 at 15:14

phuclv

37,963
15
156
475

answered Dec 02 '13 at 11:56

HRJ

17,079
11
56
80

Regarding your assignment of a ternary result being translated to a multiplication: I understand the point that optimizations may be made, but I think that's a particularly bad example. A multiplication would be much slower than any combination of assignment and branches on (most likely) any processor. I hope no compiler would actually generate something equivalent. – user1167662 Jan 09 '16 at 01:17
1

@user1167662 It depends on the micro-architecture, ofcourse. In a processor optimised for graphics, multiplication wouldn't be more than a couple of cycles. A branch is much more likely to be costlier, depending on how deep the pipeline is. Some typical numbers are 3 cycles for multiplication, v/s about 14 stages for pipeline – HRJ Jan 11 '16 at 15:49

Shader optimization: Is a ternary operator equivalent to branching?

4 Answers4