Consider this example, in which various rounding operations (round-up, round-down, round-toward-zero and round-to-nearest-with-ties-to-even) can all be expressed with a single roundsd
instruction:
use_floor(double):
roundsd xmm0, xmm0, 9
ret
use_ceil(double):
roundsd xmm0, xmm0, 10
ret
use_trunc(double):
roundsd xmm0, xmm0, 11
ret
use_nearby(double):
roundsd xmm0, xmm0, 12
ret
While round-to-nearest-with-ties-away-from-zero requires additional instructions:
use_round(double):
movapd xmm1, xmm0
andpd xmm0, XMMWORD PTR .LC1[rip]
orpd xmm0, XMMWORD PTR .LC0[rip]
addsd xmm0, xmm1
roundsd xmm0, xmm0, 3
ret
Why does this rounding mode require more instructions on x86 (unlike on Arm) and how do these bit operations on a floating-point value end up implementing the desired semantics?