How to get _mm256_rcp_pd in AVX2?

Question

For some reason _mm256_rcp_pd is not in AVX or AVX2.

In AVX512 we got _mm256_rcp14_pd.

Is there a way to get a fast approximate reciprocal in double precision on AVX2? Are we supposed to convert to single precision and then back?

You’re supposed to divide numbers with `_mm256_div_pd`. These approximate instructions, `rcpps` and `rsqrtps`, are only there because on early SSE processors like Pentium 3 division and square root were very slow, like 40-60 cycles. They became much faster since then. For instance, on AMD Zen 3 `vdivpd` takes ≤13 cycles. If you downcast your vector to FP32, use `_mm_rcp_ps`, then upcast back, these 3 instructions will take 6+3+4=13 cycles, no reason to do that. — Soonts, Mar 03 '23 at 17:09
@Soonts: "no reason" except maybe throughput instead of latency; conversion and `vrcpps` are fully pipelined. But `vdivpd` is also partially pipelined on recent CPUs, with throughput better than latency although not 1/clock. If you're considering something low-precision like this, it probably means your data should have been `float` in the first place; at least the result so you only have to convert *to* float, not back to double. — Peter Cordes, Mar 03 '23 at 18:55
Casting down to `float` can cause over- or underflows. So for a good simulation of `rcp14_pd`, you'd also have to also manually calculate the exponent -- unless you know that your input range is limited. — chtz, Mar 03 '23 at 20:08

chtz · Answer 1 · 2023-03-03T21:53:18.710

With some integer-cast-hacking, and a Newton–Raphson refinement step, you can get a somewhat accurate approximation with 3 uops. Latency is probably not too good, since this involves mixing integer and double operations. But it should be much better than divpd. This solution also assumes that all inputs are normalized doubles.

__m256d fastinv(__m256d y)
{
    // exact results for powers of two
    __m256i const magic = _mm256_set1_epi64x(0x7fe0'0000'0000'0000);
    // Bit-magic: For powers of two this just inverts the exponent, 
    // and values between that are linearly interpolated 
    __m256d x = _mm256_castsi256_pd(_mm256_sub_epi64(magic,_mm256_castpd_si256(y)));

    // Newton-Raphson refinement: x = x*(2.0 - x*y):
    x = _mm256_mul_pd(x, _mm256_fnmadd_pd(x, y, _mm256_set1_pd(2.0)));

    return x;
}

With the constants above, the inverse is exact for powers of two, but has an error of ~1.44% near sqrt(2).

If you fine-tune the magic constant as well as the 2.0 constant or add another NR-step, you can increase the accuracy.

Godbolt link: https://godbolt.org/z/f7YhnhT96

I have not expected that, but it seems your version is about 8 times faster: https://quick-bench.com/q/_guRGzj2vVf1ha0J8J3ALJfJSr4 — Soonts, Mar 03 '23 at 23:35

How to get _mm256_rcp_pd in AVX2?

1 Answers1