Does ARM frsqrts need to be used with extra fmul instructions for a Newton iteration?

Question

In the documentation for the ARM instruction frsqrts, it says:

This instruction multiplies corresponding floating-point values in the vectors of the two source SIMD and FP registers, subtracts each of the products from 3.0, divides these results by 2.0, places the results into a vector, and writes the vector to the destination SIMD and FP register.

I interpret this as yₙ₊₁ = (3 - xyₙ)/2-and indeed the following code justifies this interpretation:

.global _main
.align 2
_main:
    fmov d0, #2.0 // Goal: Compute 1/sqrt(2)
    fmov d1, #0.5 // initial guess
    frsqrts d2, d0, d1 // first approx

    mov x0, 0
    mov x16, #1 // '1' = terminate syscall
    svc #0x80   // "supervisor call"

However, reading about the Newton iterate for the inverse square root, I see that the iteration is not yₙ₊₁ = (3 - xyₙ)/2, but rather yₙ₊₁ = yₙ(3 - xyₙ²)/2. Now, obviously I can use frsqrt in combination with other instructions to get this:

    fmov d0, #2.0 // Goal: Compute 1/sqrt(2)
    fmov d1, #0.5 // initial guess
    fmul d2, d1, d1 // initial guess squared
    frsqrts d3, d0, d2 // (3-r*r*x)/2
    fmul d4, d1, d3 // d4 = r*(3-r*r*x)/2

But is seems weird to introduce a custom instruction which only get your halfway to your goal. Am I misusing this instruction?

You'd think they might have an application note with sample code, but I couldn't find one either. — Nate Eldredge, Aug 13 '23 at 02:08
"But it seems weird to introduce a custom instruction which only get your halfway to your goal." But remember, this is a RISC machine. It's reasonable to have a custom instruction here because it's just a glorified fused multiply-subtract, which the machine already has dedicated hardware for (e.g. for `fmsub`. The only difference is that one operand is hardcoded as 3 to save you a register, and the bonus feature of dividing by two at the end. The latter just means decrementing the exponent, and there's hardware for that too, for instructions like `shadd`. — Nate Eldredge, Aug 13 '23 at 02:15
By contrast, there's presumably no special hardware to fuse *two* multiplications, so if there was an instruction to do the whole thing, it would just have to be microcoded, and wouldn't be any faster than doing the first multiply as a separate instruction. All you'd gain is code density, and it isn't the RISC way to achieve that through microcode. — Nate Eldredge, Aug 13 '23 at 02:18

Does ARM frsqrts need to be used with extra fmul instructions for a Newton iteration?

0 Answers0