Is there a way to auto-vectorize while using reciprocals and reciprocal square roots?

Question

I'm looking for a way to force my compilers (GCC for ARM and Intel ICL for Intel/AMD) to emit rcp(rsqrt(x)), which estimates a square root, instead of sqrt(x), in specific locations (there are places where I do need the normal precision). gcc has a "-freciprocal-math" option, but unfortunately that adds a Newton-Raphson step to improve the precision, which makes it almost as heavy as a normal sqrt. For what I'm using it for, the raw output from rcp(rsqrt(x)) is good enough.

In some places I'm using hand-written vector intrinsics, and there it's simple (just use those intrinsics instructions). And the following code works for non-vectorized loops:

__forceinline float fast_sqrt(const float f)
{
    return _mm_cvtss_f32(_mm_rcp_ss(_mm_rsqrt_ss(_mm_set_ss(f))));
}

That emits these 2 instructions:

        rsqrtss   xmm0, xmm0
        rcpss     xmm0, xmm0

(Godbolt link: https://godbolt.org/z/hbdrn9deE)

The problem is I have a number of loops that can easily be vectorized by the compiler, but using this instruction blocks that. You can see it in the Godbolt-link: If you replace the fast_sqrt call by sqrt it immediately vectorizes the loop.

Of course I could manually vectorize them... but that leads to a lot of code overhead that I really don't want to have. So I wonder if there's a way to force the compiler to emit rsqrtss (and rsqrtps where vectorization is possible).

Is there a particular reason you compute `1/pow(f, -0.5)` instead of `pow(f, -0.5)*f`? — EOF, Aug 24 '21 at 22:30
@EOF: I actually did that first, and it caused issues. Not only does it take (slightly) more CPU cycles, but worse, it fails for input 0. The trick with rcp(rsqrt(f)) is that if f is 0, rsqrt(f) returns INF, and rcp(INF) is 0. However, 0 * INF is still INF (or maybe NAN, it's not 0 anyway). — hvz, Aug 24 '21 at 22:49
Good point. I guess in AVX512 you'd check for `<= 0` into a mask register and zero-merge the `VRCP14PS` and vector multiply? — EOF, Aug 24 '21 at 22:56
Not sure about AVX512 (I did read somewhere that rcp is slower in AVX because it can only do rcp on 4 values at the same time, not 8. So it might be even slower in AVX512.). At least for SSE, the rcp of rsqrt is the fastest way of doing it. For normal AVX I think that's still true, SSE rcp takes 3 cycles, if AVX rcp takes 6 that's still less than a mul (4 cycles) + a compare + an and (I don't know the number of cycles for those). — hvz, Aug 24 '21 at 23:07
On Skylake-AVX512 / Cascade Lake, `VRCP14PS zmm` is 2p0 + p5, with 2c throughput, vs. 1p0 with 1c throughput for the YMM version. ([search on uops.info](https://uops.info/table.html?search=rcp%20ps%20ymm\)&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKL=on&cb_SKX=on&cb_measurements=on&cb_base=on&cb_avx=on&cb_avx512=on&cb_others=on)). YMM performance is the same with AVX 11-12-bit precision vs. AVX-512 14-bit precision. So yes, the fast-reciprocal unit it only half-width. But even on Skylake-client, it's half of 512-bit, unlike Haswell. And `vsqrtps` is 6c throughput for YMM, 12c for ZMM. — Peter Cordes, Aug 24 '21 at 23:15
The *latency* of the instructions doesn't matter for this, the *throughput* does. In Agner Fog's instruction tables (taking the example of IceLake), I can see 128/256-bit reciprocal and reciprocal sqrt instructions running on the same port, whereas multiply can issue on a different port as well, and at least one of the floating-point comparison instructions can issue on a third unrelated port. I'd expect AVX512-instructions with 256-bit width and multiply instead of reciprocal to run faster. — EOF, Aug 24 '21 at 23:18
Note that if this is in a loop with lots of other FP FMA/add/mul work so you only need a SIMD sqrt every 6 or 12 cycles, the not-fully-pipelined FP div/sqrt unit won't be a bottleneck so you can just use it, costing fewer front-end uops and similar latency vs. a chain of rsqrt / rcp / mul. ([Floating point division vs floating point multiplication](https://stackoverflow.com/a/45899202) collects up div and sqrt throughput numbers from recent uarches up to Skylake and Zen1) — Peter Cordes, Aug 24 '21 at 23:19
@Jake'Alquimista'LEE: No, it either keeps the sqrt, or replaces it with a rsqrt with a Newton-Rhapson, which isn't faster. — hvz, Aug 25 '21 at 19:07
@hvz Very often assembly is the only way. That's why I don't bother with compilers at all when dealing with performance sensitive routines. — Jake 'Alquimista' LEE, Aug 26 '21 at 08:15

Is there a way to auto-vectorize while using reciprocals and reciprocal square roots?

0 Answers0