How to refine the result of a floating point division result?

Question

I have an an algorithm for calculating the floating point square root divide using the newton-raphson algorith. My results are not fully accurate and sometimes off by 1 ulp.

I was wondering if there is a refinement algorithm for floating point division to get the final bits of accuracy. I use the tuckerman test for square root, but is there a similar algorithm for division? Or can the tuckerman test be adapted for division?

I tried using this algorithm too but didn't get full accuracy results:

z= divisor
r_temp = divisor*q
 r = dividend - r_temp
result_temp = r*z
q + result_temp

There is an algorithm for rounding the result of a floating-point division based on FMA (fused-multiply add), but from your previous questions I gather that is of no use to you as your platform does not provide FMA? [This technical report](http://www.hpl.hp.com/techreports/93/HPL-93-42.pdf) by Alan H. Karp and Peter Markstein from 1993 states: "As originally formulated, Tuckerman rounding can be used only for square root, not division." — njuffa, Feb 21 '15 at 16:44
Have you tried a literature search, e.g. via Google Scholar? The following paper may be useful as it does not assume availability of FMA: [Bogdan Pasca, Correctly Rounded Floating-Point Division for DSP-enabled FPGAS](http://www.altera.com/technology/dsp/floating-point/floating-point-division-for-dsp-fpga.pdf). — njuffa, Feb 22 '15 at 03:07

njuffa · Accepted Answer · 2015-02-24T08:30:26.337

One practical way of correctly rounding the result of iterative division is to produce a preliminary quotient to within one ulp of the mathematical result, then use the exactly-computed residual to compute the final result.

The tool of choice for the exact computation of residuals is the fused-multiply add (FMA) operation. Much of the foundational work of this approach (both in terms of the mathematics and of practical implementations) is due to Peter Markstein and was later refined by other researchers. Markstein's results are nicely summarized in his book:

Peter Markstein, IA-64 and Elementary Functions: Speed and Precision. Prentice-Hall 2000.

A straightforward approach to correctly-rounded division using Markstein's approach is to first compute a correctly-rounded reciprocal, then compute the correctly-rounded quotient by multiplying it with the dividend, followed by the final residual-based rounding step.

The residual can be used to compute the final rounded result directly, ~~as is shown for the quotient rounding in the code below~~ (I noticed that this code sequence resulted in an incorrectly rounded result in one out of 10¹¹ divisions, and replaced it with another instance of the comparison-and-select idiom) which is the technique used by Markstein. Alternatively it may be used as part of a two-sided comparison-and-select process somewhat akin to Tuckerman rounding, which is shown for the reciprocal rounding in the code below.

There is one caveat with regard to the reciprocal computation. Many commonly used iterative approaches (including the one I used below), when combined with Markstein's rounding technique, deliver an incorrect result if the mantissa of the divisor consists entirely of 1-bits.

One way of getting around this is to treat this case specially. In the code below I instead opted for a two-sided comparison-and-select approach, which also allows errors slightly larger than one ulp prior to rounding and thus eliminates the need to use FMA in the reciprocal iteration itself.

Please note that I omitted the handling of sub-normal results in the C code below to keep the code concise and easy to follow. I have limited myself to standard C library functions for tasks like extracting parts of floating-point operands, assembling floating-point numbers, and applying one-ulp increments and decrements. Most platforms will offer machine-specific options with higher performance for these.

float my_divf (float a, float b)
{
    float q, r, ma, mb, e, s, t;
    int ia, ib;

    if (!isnanf (a+b) && !isinff (a) && !isinff (b) && (b != 0.0f)) {
        /* normal cases: remove sign, split args into exponent and mantissa */
        ma = frexpf (fabsf (a), &ia);
        mb = frexpf (fabsf (b), &ib);
        /* minimax polynomial approximation to 1/mb for mb in [0.5,1) */
        r =        - 3.54939341e+0f;
        r = r * mb + 1.06481802e+1f;
        r = r * mb - 1.17573657e+1f;
        r = r * mb + 5.65684575e+0f;
        /* apply one iteration with cubic convergence */
        e = 1.0f - mb * r;
        e = e * e + e;
        r = e * r + r;
        /* round reciprocal to nearest-or-even */
        e = fmaf (-mb, r, 1.0f); // residual of 1st candidate
        s = nextafterf (r, copysignf (2.0f, e)); // bump or dent 
        t = fmaf (-mb, s, 1.0f); // residual of 2nd candidate
        r = (fabsf (e) < fabsf (t)) ? r : s; // candidate with smaller residual
        /* compute preliminary quotient from correctly-rounded reciprocal */
        q = ma * r;
        /* round quotient to nearest-or-even */
        e = fmaf (-mb, q, ma); // residual of 1st candidate
        s = nextafterf (q, copysignf (2.0f, e)); // bump or dent 
        t = fmaf (-mb, s, ma); // residual of 2nd candidate
        q = (fabsf (e) < fabsf (t)) ? q : s; // candidate with smaller residual
        /* scale back into result range */
        r = ldexpf (q, ia - ib);
        if (r < 1.17549435e-38f) {
            /* sub-normal result, left as an exercise for the reader */
        }
        /* merge in sign of quotient */
        r = copysignf (r, a * b);
    } else {
        /* handle special cases */
        if (isnanf (a) || isnanf (b)) {
            r = a + b;
        } else if (b == 0.0f) {
            r = (a == 0.0f) ? (0.0f / 0.0f) : copysignf (1.0f / 0.0f, a * b);
        } else if (isinff (b)) {
            r = (isinff (a)) ? (0.0f / 0.0f) : copysignf (0.0f, a * b);
        } else {
            r = a * b;
        }
    }
    return r;
}

Since i don't have an FMA, is it still possible to get the correct result? Using round to zero or something similar in the process. — Veridian, Feb 24 '15 at 01:04
@starbox: You might want to take a look at the paper I pointed to earlier and see whether that approach works for you. I am not sure what your context is: If you need this functionality for production software or hardware under the constraints of your specific platform [which you have not explained in detail], you may want to look into hiring a consultant. If you just want to learn the various algorithms, I would suggest pulling papers and attempting implementation. Working through the details is a sure way of acquiring a solid grasp of relevant techniques. — njuffa, Feb 24 '15 at 01:53
Thank you, yes I have a copy of that book with me right here. I'm creating custom hardware that doesn't have an FMA, that needs full accuracy, so I was just wondering if I can remove the FMA instructions from your code and use round to zero to get full accuracy. I'll try experimenting — Veridian, Feb 24 '15 at 02:07
@starbox: Various microprocessors used iterative approximations for division (e.g. using Goldschmidt) and achieved correct rounding. This required some number of additional bits to be able to round correctly. See [this paper](http://www.acsel-lab.com/arithmetic/arith14/papers/ARITH14_Oberman.pdf) about division in the Athlon processor for example. As I recall you are not in favor of additional bits, from what I recall from your past questions. I do not know of a solution for your current issue that requires neither FMA nor additional bits, short of emulating the process in integer arithmetic. — njuffa, Feb 24 '15 at 02:16
I tried your program using these two inputs, and I got the wrong answer: dividend = 0x3F7C15B4; divisor = 0xDE50F303;, the answer came out as 0xA09A6CA2 when it should be 0xA09A6CA1, any ideas what might be wrong? — Veridian, Mar 07 '15 at 01:12
I see this: `a=3f7c15b4 b=de50f3038 res=a09a6ca1 ref=a09a6ca1`. What compiler did you use to build the code, on which platform? The compiler must support strict IEEE-754 compliance, and the compiler switches need to be set accordingly. For the record, I used the Intel compiler and used the "strict" floating-point settings. I will double check that the code I posted is identical to what I am running, but since I used copy&paste I assume it is. To narrow down the issue maybe you can post the value of 'r' after the polynomial, after cubic iteration, directly before computation of initial 'q'? — njuffa, Mar 07 '15 at 02:05
Here are the intermediate values I am seeing for your test case: `r after poly = 1.22416258 r after iter = 1.22517776 r before q = 1.22517776`. — njuffa, Mar 07 '15 at 02:10
g++ -Wall -std=gnu++0x div.cpp is how I'm compiling, I'm using Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz. — Veridian, Mar 07 '15 at 02:21
In Hex and real format: r after the polynomial: 0x3F9CB15C r after the polynomial: 1.22416258 r after the one iteration : 0x3F9CD2A0 r after the polynomial: 1.22517776 r before q : 0x3F9CD2A0 r before q 1.22517776 — Veridian, Mar 07 '15 at 02:38
Your CPU doesn't have FMA in hardware. There could be a problem with emulation of `fmaf()` in glibc. More likely is that your code is not compiled to use *only* single-precision operations, leading to issues with double-rounding when converting from higher-precision intermediates. Check whether there are x87 instructions in the generated binary code, and if so force the generation of SSE code. I think `-msse` should do the trick but I have not used gcc in a while. Use of `-ffloat-store` might also help, or reducing optimizations to `-O1`. I don't know whether there is a "strict IEEE fp" switch — njuffa, Mar 07 '15 at 02:58
`initial q = 1.20644009 final q = 1.20644009 r after ldexpf = 2.61604994e-019 final r = -2.61604994e-019` — njuffa, Mar 07 '15 at 03:03
Similarly, here is my output: Initial q: 1.20644009 e: 5.96046448E-08 s: 1.20644021 t: -5.96046448E-08 q: 1.20644021 r after ldexpf: 2.61605019E-19 Note, the s, t, and q here are just me putting print statements after each line, so s correspondes to the line after the second "bump and dent" comment. — Veridian, Mar 09 '15 at 05:34
`initial q = 1.20644009 e = 3.16196989e-08 s = 1.20644021 t = -6.56798846e-08 final q = 1.20644009`. Looks like `fmaf()` is not working properly on your platform. I double-checked `e` by using my own `fmaf()` emulation and it matches the data above. — njuffa, Mar 09 '15 at 07:39

How to refine the result of a floating point division result?

1 Answers1

Linked