Summary:
- Branchless might not even be the best choice.
- Inline asm defeats some other optimizations, try other source changes first, e.g.
? :
often compiles branchlessly, also use booleans as integer 0/1.
- If you use inline-asm, make sure you optimize the constraints as well to make the compiler-generated code outside your asm block efficient.
- The whole thing is doable with
cmp %[b], %[a]
/ adc %[k],%[k]
. Your hand-written code is worse than what compilers generate, but they are beatable in the small scale for cases where constant-propagation / CSE / inlining didn't make this code (partially) optimize away.
If your compiler generates branchy code, and profiling shows that was the wrong choice (high counts for branch misses at that instruction, e.g. on Linux perf record -ebranch-misses ./my_program
&& perf report
), then yes you should do something to get branchless code.
(Branchy can be an advantage if it's predictable: branching means out-of-order execution of code that uses (k<<1) + 1
doesn't have to wait for a
and b
to be ready. LLVM recently merged a patch that makes x86 code-gen more branchy by default, because modern x86 CPUs have such powerful branch predictors. Clang/LLVM nightly build (with that patch) does still choose branchless for this C source, at least in a stand-alone function outside a loop).
If this is for a binary search, branchless probably is good strategy, unless you see the same search often. (Branching + speculative execution means you have a control dependency off the critical path,
Compile with profile-guided optimization so the compiler has run-time info on which branches almost always go one way. It still might not know the difference between a poorly-predictable branch and one that does overall take both paths but with a simple pattern. (Or that's predictable based on global history; many modern branch-predictor designs index based on branch history, so which way the last few branches went determine which table entry is used for the current branch.)
Related: gcc optimization flag -O3 makes code slower then -O2 shows a case where a sorted array makes for near-perfect branch prediction for a condition inside a loop, and gcc -O3
's branchless code (without profile guided optimization) bottlenecks on a data dependency from using cmov
. But -O3 -fprofile-use
makes branchy code. (Also, a different way of writing it makes lower-latency branchless code that also auto-vectorizes better.)
Inline asm should be your last resort if you can't hand-hold the compiler into making the asm you want, e.g. by writing it as (k<<1) + (a<b)
as others have suggested.
Inline asm defeats many optimizations, most obvious constant-propagation (as seen in some other answers, where gcc moves a constant into a register outside the block of inline-asm code). https://gcc.gnu.org/wiki/DontUseInlineAsm.
You could maybe use if(__builtin_constant_p(a))
and so on to use a pure C version when the compiler has constant values for some/all of the variables, but that's a lot more work. (And doesn't work well with Clang, where __builtin_constant_p()
is evaluated before function inlining.)
Even then (once you've limited things to cases where the inputs aren't compile-time constants), it's not possible to give the compiler the full range of options, because you can't use different asm blocks depending on which constraints are matched (e.g. a
in a register and b
in memory, or vice versa.) In cases where you want to use a different instruction depending on the situation, you're screwed, but here we can use multi-alternative constraints to expose most of the flexibility of cmp
.
It's still usually better to let the compiler make near-optimal code than to use inline asm. Inline-asm destroys the ability of the compiler to reuse use any temporary results, or spread out the instructions to mix with other compiler-generated code. (Instruction-scheduling isn't a big deal on x86 because of good out-of-order execution, but still.)
That asm is pretty crap. If you get a lot of branch misses, it's better than a branchy implementation, but a much better branchless implementation is possible.
Your a<b
is an unsigned compare (you're using setb
, the unsigned below condition). So your compare result is in the carry flag. x86 has an add-with-carry instruction. Furthermore, k<<1
is the same thing as k+k
.
So the asm you want (compiler-generated or with inline asm) is:
# k in %rax, a in %rdi, b in %rsi for this example
cmp %rsi, %rdi # CF = (a < b) = the carry-out from edi - esi
adc %rax, %rax # eax = (k<<1) + CF = (k<<1) + (a < b)
Compilers are smart enough to use add
or lea
for a left-shift by 1, and some are smart enough to use adc
instead of setb
, but they don't manage to combine both.
Writing a function with register args and a return value is often a good way to see what compilers might do, although it does force them to produce the result in a different register. (See also this Q&A, and Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”).
// I also tried a version where k is a function return value,
// or where k is a global, so it's in the same register.
unsigned funcarg(unsigned a, unsigned b, unsigned k) {
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1);
return k;
}
On the Godbolt compiler explorer, along with a couple other versions. (I used unsigned
in this version, because you had addl
in your asm. Using unsigned long
makes everything except the xor-zeroing into 64-bit registers. (xor %eax,%eax
is still the best way to zero RAX.)
# gcc7.2 -O3 When it can keep the value in the same reg, uses add instead of lea
leal (%rdx,%rdx), %eax #, <retval>
cmpl %esi, %edi # b, a
adcl $0, %eax #, <retval>
ret
#clang 6.0 snapshot -O3
xorl %eax, %eax
cmpl %esi, %edi
setb %al
leal (%rax,%rdx,2), %eax
retq
# ICC18, same as gcc but fails to save a MOV
addl %edx, %edx #14.16
cmpl %esi, %edi #17.12
adcl $0, %edx #17.12
movl %edx, %eax #17.12
ret #17.12
MSVC is the only compiler that doesn't make branchless code without hand-holding. ((k<<1) + ( a < b );
gives us exactly the same xor
/cmp
/setb
/ lea
sequence as clang above (but with the Windows x86-64 calling convention).
funcarg PROC ; x86-64 MSVC CL19 -Ox
lea eax, DWORD PTR [r8*2+1]
cmp ecx, edx
jb SHORT $LN3@funcarg
lea eax, DWORD PTR [r8+r8] ; conditionally jumped over
$LN3@funcarg:
ret 0
Inline asm
The other answers cover the problems with your implementation pretty well. To debug assembler errors in inline asm, use gcc -O3 -S -fverbose-asm
to see what the compiler is feeding to the assembler, with the asm template filled in. You would have seen addl %rax, %ecx
or something.
This optimized implementation uses multi-alternative constraints to let the compiler pick either the cmp $imm, r/m
, cmp r/m, r
, or cmp r, r/m
forms of CMP. I used two alternates that split things up not by opcode but by which side included the possible memory operand. "rme"
is like "g"
(rmi) but limited to 32-bit sign-extended immediates).
unsigned long inlineasm(unsigned long a, unsigned long b, unsigned long k)
{
__asm__("cmpq %[b], %[a] \n\t"
"adc %[k],%[k]"
: /* outputs */ [k] "+r,r" (k)
: /* inputs */ [a] "r,rm" (a), [b] "rme,re" (b)
: /* clobbers */ "cc"); // "cc" clobber is implicit for x86, but it doesn't hurt
return k;
}
I put this on Godbolt with callers that inline it in different contexts. gcc7.2 -O3
does what we expect for the stand-alone version (with register args).
inlineasm:
movq %rdx, %rax # k, k
cmpq %rsi, %rdi # b, a
adc %rax,%rax # k
ret
We can look at how well our constraints work by inlining into other callers:
unsigned long call_with_mem(unsigned long *aptr) {
return inlineasm(*aptr, 5, 4);
}
# gcc
movl $4, %eax #, k
cmpq $55555, (%rdi) #, *aptr_3(D)
adc %rax,%rax # k
ret
With a larger immediate, we get movabs
into a register. (But with an "i"
or "g"
constraint, gcc would emit code that doesn't assemble, or truncates the constant, trying to use a large immediate constant for cmpq.)
Compare what we get from pure C:
unsigned long call_with_mem_nonasm(unsigned long *aptr) {
return handhold(*aptr, 5, 4);
}
# gcc -O3
xorl %eax, %eax # tmp93
cmpq $4, (%rdi) #, *aptr_3(D)
setbe %al #, tmp93
addq $8, %rax #, k
ret
adc $8, %rax
without setc
would probably have been better, but we can't get that from inline asm without __builtin_constant_p()
on k
.
clang often picks the mem alternative if there is one, so it does this: /facepalm. Don't use inline asm.
inlineasm: # clang 5.0
movq %rsi, -8(%rsp)
cmpq -8(%rsp), %rdi
adcq %rdx, %rdx
movq %rdx, %rax
retq
BTW, unless you're going to optimize the shift into the compare-and-add, you can and should have asked the compiler for k<<1
as an input.