Title question: _mm_mulhrs_epi16
(pmulhrsw
) is I think intended for 16-bit fixed-point multiply with averaging.
sqrt
: x86 doesn't have any integer square root support, but the x87 fsqrt
instruction does handle 80-bit long double
with a 64-bit mantissa. (i.e. sqrtl()
in C). But only one at a time (scalar), with even worse throughput than sqrtpd
. That will also cost store/reload latency getting data in/out of x87 registers, even though fild
/ fistp
can convert from/to int64_t
with round-to-nearest, if you can get a C compiler to emit those.
e.g. on GNU/Linux (or other non-Windows platforms where long double
is the 80-bit x87 type), I think this might be viable:
#include <stdint.h>
#include <math.h>
int64_t fixed_point_sqrt(int64_t a) {
return lrintl(sqrtl(a) * (1LL<<32)); // rescale for your fixed point range
}
(lrintl does long double
-> long
conversion with the current default rounding mode, i.e. round to nearest. Otherwise just casting, you can get SSE3 fisttp
truncation, or without SSE3 a slow change of the rounding mode to truncation and back.)
With GCC and clang (targeting Linux), you get (on Godbolt)
# gcc and clang -O3 -fno-math-errno are both similar; this is clang:
fixed_point_sqrt(long): # @fixed_point_sqrt(long)
mov qword ptr [rsp - 16], rdi
fild qword ptr [rsp - 16] # convert int64 -> 80-bit x87
fsqrt
fmul dword ptr [rip + .LCPI1_0] # float 4.2949673E+9 is exactly representable
fistp qword ptr [rsp - 8] # convert back with roundinging
mov rax, qword ptr [rsp - 8]
ret
Actually fisttp vs. fistp might not matter if the 80-bit FP value is always a whole integer anyway; not sure how the range works out there.
The normal use-case for fixed-point is for narrow elements where you don't want to waste space on an exponent field; your case of very limited dynamic range (largest value only twice the smallest) does make the exponent field redundant for FP values, but modern x86 CPUs spend a lot of transistors on SIMD-FP throughput so that's still a good bet for high performance.
If you care about maximum precision, note that the integer square root of a 64-bit integer only has 32 significant bits. But for fixed-point, the square roots of numbers between 0.5 and 1.0 are between .75 and 1.0, so you're only losing 1 bit of precision (MSB always set in the result). So the rescaling makes it different from pure integer sqrt.
If you need more mantissa bits, you can use double-double
(https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Double-double_arithmetic) pairs of double
, double-double-arithmetic. Multiplying and adding such numbers is possible with SIMD, but IDK how efficiently one could implement sqrt.