I'm looking for a way to force my compilers (GCC for ARM and Intel ICL for Intel/AMD) to emit rcp(rsqrt(x)), which estimates a square root, instead of sqrt(x), in specific locations (there are places where I do need the normal precision). gcc has a "-freciprocal-math" option, but unfortunately that adds a Newton-Raphson step to improve the precision, which makes it almost as heavy as a normal sqrt. For what I'm using it for, the raw output from rcp(rsqrt(x)) is good enough.
In some places I'm using hand-written vector intrinsics, and there it's simple (just use those intrinsics instructions). And the following code works for non-vectorized loops:
__forceinline float fast_sqrt(const float f)
{
return _mm_cvtss_f32(_mm_rcp_ss(_mm_rsqrt_ss(_mm_set_ss(f))));
}
That emits these 2 instructions:
rsqrtss xmm0, xmm0
rcpss xmm0, xmm0
(Godbolt link: https://godbolt.org/z/hbdrn9deE)
The problem is I have a number of loops that can easily be vectorized by the compiler, but using this instruction blocks that. You can see it in the Godbolt-link: If you replace the fast_sqrt call by sqrt it immediately vectorizes the loop.
Of course I could manually vectorize them... but that leads to a lot of code overhead that I really don't want to have. So I wonder if there's a way to force the compiler to emit rsqrtss (and rsqrtps where vectorization is possible).