Optimal bit twiddling for the One's complement absolute value operation on modern x86 processors

Question

The fastest way to compute the absolute value of a twos complement number is a common enough operation that optimized implementations are widely available. So let's consider another case. What if we want to get the absolute value of a one's complement number using x86 assembly?

A quick but probably suboptimal branchless implementation that I have is to take the sign bit by ANDING with a 10000.... mask and shifting, multiplying that with a 11111... mask, and xor that to the original number. But is there a better way to do it?

One application where this pops up is for an optimal implementation of grey decoding. The common implementation of a grey decode for a 64 bit integer uses six xor operations and six bitshifts. However, carryless multiplication of a number with ......1111110 will give either the grey decoding or its bitwise negation, and taking the ones complement abs value of that gives the grey decoding. As long as that can be microoptimized, it should be faster than the most widespread way of doing it. For the purpose of the question, the starting state can be assumed to be either any standard C calling convention or right after a CLMUL operation (taking the non-carry output).

Shouldn't that be just `x ^ (x >>> N - 1)` where `N` is the number of bits and `>>>` is an arithmetic right shift? In ARM syntax, that would be one instruction: `EOR R0, R0, R0 ASR #31`. — fuz, Jun 27 '22 at 18:56

score 6 · Answer 1 · answered Jun 27 '22 at 18:57

6

take the sign bit by ANDING with a 10000.... mask and shifting, multiplying that with a 11111... mask

The sign mask could be computed by just an arithmetic right shift:

mov edx, eax
sar eax, 31   ; <- compute the sign mask
xor eax, edx

As for decoding Gray codes, there are other tricks that rely on modern instructions

answered Jun 27 '22 at 18:57

harold

61,398
6
86
164

How does that compare to the use of `cdq; xor rax, rdx`? – njuffa Jun 27 '22 at 19:54
1

@njuffa: CDQ or CQO are single-uop peephole optimizations over `mov`/`sar` that are more efficient on all CPUs, if you happen to already have the value in EAX or RAX. On a variety of modern Intel and AMD (https://uops.info/), including low power Alder Lake-E, `cdq` has the same latency/throughput/ports as `sar r32, i8`, as you'd expect; probably just decodes to a no-flags shift uop. So yes, that's a nice optimization for this answer, saving a uop. (But same critical path latency even without mov-elim, this answer shifts EAX to avoid having mov on the critical path on Ice Lake or pre-IvB.) – Peter Cordes Jun 28 '22 at 03:03

Optimal bit twiddling for the One's complement absolute value operation on modern x86 processors

1 Answers1