4

I'm following the algorithms provided by this paper by Andrew Thall describing algorithms for performing math using the df64 data type, a pair of 32-bit floating point numbers used to emulate the precision of a 64-bit floating point number. However, there appears to be some inconsistencies (mistakes?) in how they've written their Division and Square Root functions.

This is how the Division function is written in the paper:

float2 df64_div(float2 B, float2 A) {
    float xn = 1.0f / A.x;
    float yn = B.x * xn;
    float diff = (df64_diff(B, df64_mult(A, yn))).x;
    float2 prod = twoProd(xn, diffTerm);

    return df64_add(yn, prodTerm);
}

The language used to write this code appears to be Cg, for reference, although you should be able to interpret this code in C++ if you treat float2 as though it's merely an alias for struct float2{float x, y;};, with some extra syntax to support arithmetic operations directly on the type.

For reference, these are the headers of the functions being used in this code:

float2 df64_add(float2 a, float2 b);
float2 df64_mult(float2 a, float2 b);
float2 df64_diff(/*Not provided...*/);
float2 twoProd(float a, float b);

So a couple of problems immediately stand out:

  • diffTerm and prodTerm are never defined. There's two variables, diff and prod which are defined, but it's not certain that these are the terms that were intended in this code.
  • No declaration of df64_diff is provided. Presumably this is meant to support subtraction; but again, this is not clear.
  • df64_mult is a function that does not accept a 32-bit float as an argument; it only supports two pairs of 32-bit floats as arguments. It is not clear how the paper expects this function call to compile
  • Same also for df64_add, which also only accepts pairs of 32-bit floats as arguments, but is here invoked with the first argument being only a single 32-bit float.

I'm making an educated guess that this is a correct implementation of this code, but because even a correct implementation of this function has unavoidable errors in the computation, I can't tell if it's correct, even if it gives values that "seem" correct:

float2 df64_div(float2 B, float2 A) {
    float xn = 1.0f / A.x;
    float yn = B.x * xn;
    float diff = (df64_diff(B, df64_mult(A, float2(yn, 0)))).x;
    float2 prod = twoProd(xn, diff);

    return df64_add(float2(yn, 0), prod);
}

float2 df64_diff(float2 a, float2 b) {
    return df64_add(a, float2(-b.x, -b.y));
}

So my question is this: is the written implementation of this algorithm as seen in the paper accurate (because it depends on behavior of the Cg language that I'm not aware of?), or isn't it? And regardless, is my interpolation of that code a correct implementation of the division algorithm described in the paper?

Note: my target language is C++, so although the differences between the languages (for this kind of algorithm) are minor, my code is written in C++ and I'm looking for correctness for the C++ language.

Xirema
  • 19,889
  • 4
  • 32
  • 68
  • It looks plausible enough. What does your testing tell you ? I assume you have a 64-bit double to test against ? Sadly, of course, 2 x 32-bit floats only give you 48 bits precision, compared to 53 bits in 64-bit double :-( – Chris Hall Feb 26 '20 at 18:15
  • @ChrisHall My testing shows that some rudimentary division works correctly, but like I said: the calculation involves some degree of error by the intrinsic nature of the calculation (see your observation that this type has fewer bits of precision than an actual IEEE754 64-bit float). So I need a more rigorous proof that my implementation is correct (or at least faithfully reproducing the algorithm described in the paper) than merely observing that a few test cases fall within a plausible error range of the expected value. – Xirema Feb 26 '20 at 18:39
  • "Hackers's Delight" (Henry S Warren Jr.) covers Multiword Division for integers, and what you have is consistent with that. However, the proof that the integer version produces exactly the right result makes my head hurt... and I don't know how to extend that to this case. Sorry. I would test of lots of randomly generated mantisas, and check that the results are the same as 64-bit double divisions when rounded to 48 bits. I'd try many divisions with the same MS halves of the arguments, but different LS halves. Checking for correct exponents and signs I'd say requires fewer test cases. – Chris Hall Feb 26 '20 at 20:42
  • @ChrisHall If anyone could look into that, that might be a good answer to provide to this question to support the algorithm as written. A Multiword Division algorithm would probably have most of the same properties as this one does. – Xirema Feb 26 '20 at 22:58
  • 1
    @ChrisHall The double-float format provides 49 bits of precision. The sign bit of the tail provides the "additional" bit. As a corollary, any IEEE-754 `binary64` operand with at least four trailing zero bits in the mantissa (significand) is exactly representable as a double-float when `float ` is mapped to IEEE-754 `binary32` (modulo restrictions on the magnitude obviously due to more limited exponent range and potential of sub-normals to affect the tail of the double-float). – njuffa Feb 27 '20 at 23:47

2 Answers2

2

Reviewing the Pseudocode Algorithm as written in the book appears to support the C++ implementation of this algorithm, although my unfamiliarity with Cg means I can't prove that that implementation is correct for Cg.

Algorithm as described in the Paper

So breaking down these steps into plain english:

  1. The function takes two parameters, each of which are [pseudo-]double precision floating point values, and where the second parameter is not equal to 0
  2. The variable xn is assigned the arithmetic reciprocal of the higher order component of the [pseudo-]double divisor, calculated using single precision floating point math
  3. The variable yn is assigned the product of the higher order component of the [pseudo-]double dividend and xn, calculated using single precision floating point math
  4. The product of the [pseudo-]double Divisor and yn is calculated
    • This is the first tricky part, because the paper doesn't describe an algorithm for [pseudo-]double x single multiplication. We can see in the Cg algorithm that the Cg algorithm clearly maps to this step 1-to-1, but the Cg rules for promoting a scalar value to a vector value are unknown.
    • What we can say, however, is that we do have a function for multiplying a double by a double, and a single can be promoted to a double by padding its lower order component with 0, so we can do that.
  5. The difference between the Dividend and the product calculated in step 4 is calculated, and only the higher order component is kept as a single-precision floating point value
    • What makes this tricky is that the paper doesn't describe an algorithm for subtraction. However, it does describe an algorithm for converting a [IEEE754-]double into a [pseudo-]double, and an observation we can make is that negative [IEEE754-]doubles, when converted, have negative values for both its higher order and lower order components. So logically, a [pseudo-]double can be negated by simply negating both of its components. And a negated number added is mathematically equivalent to subtraction, so we can build a subtraction algorithm using this knowledge.
  6. The product of xn and step 5 is performed, preserving the extended precision that would otherwise be lost in a single x single multiplication.
    • The twoProd function exists for exactly this purpose.
  7. The sum of step 6 and yn is calculated
    • Again, we can use the [pseudo-]double addition algorithm if we simply promote yn to a [pseudo-]double by padding the lower order component with 0
  8. The result of step 7 is the returned value

So understanding this algorithm we can map each of these steps directly to the C++ algorithm I wrote:

//(1) Takes two [pseudo-]doubles, returns a [pseudo-]double
float2 df64_div(float2 B, float2 A) {
    //(2) single float divided by single float
    float xn = 1.0f / A.x;
    // (3) single float multiplied by single float
    float yn = B.x * xn;
    //                        (4) double x double multiplication
    //                                       (4a) yn promoted to [pseudo-]double
    //            (5) subtraction                           (5a) only higher order component kept
    float diff = (df64_diff(B, df64_mult(A, float2(yn, 0)))).x;
    // (6) single x single multiplication with extra precision preserved using twoProd
    float2 prod = twoProd(xn, diff);
    // (7) adding higher-order division to lower order division
    //              (7a) yn promoted to [pseudo-]double
    // (8) value is returned
    return df64_add(float2(yn, 0), prod);
}

float2 df64_diff(float2 a, float2 b) {
    //                 (5a) negating both components is a logical negation of the whole number
    return df64_add(a, float2(-b.x, -b.y));
}

From this, we can conclude that this is a correct implementation of the algorithm described in this paper, upheld by some testing I've done to validate that performing these operations in this manner yields results that appear to be correct.

Xirema
  • 19,889
  • 4
  • 32
  • 68
2

Xirema's answer provides a faithful rendering of Thall's high-radix long-hand division algorithm into C++. Based on fairly extensive testing against a higher-precision reference, I found its maximum relative error to be on the order of 2-45, provided there are no underflows in intermediate computation.

On platforms that provide a fused multiply-add operation (FMA), the following Newton-Raphson-based division algorithm due to Nagai et. al. is likely to be more efficient and achieves identical accuracy in my testing, that is, maximum relative error of 2-45.

/*
  T. Nagai, H. Yoshida, H. Kuroda, Y. Kanada, "Fast Quadruple Precision 
  Arithmetic Library on Parallel Computer SR11000/J2." In: Proceedings 
  of the 8th International Conference on Computational Science, ICCS '08, 
  Part I, pp. 446-455.
*/
float2 div_df64 (float2 a, float2 b)
{
    float2 t, c;
    float r, s;
    r = 1.0f / b.x;
    t.x = a.x * r;
    s = fmaf (-b.x, t.x, a.x);
    t.x = fmaf (r, s, t.x);
    t.y = fmaf (-b.x, t.x, a.x);
    t.y = a.y + t.y;
    t.y = fmaf (-b.y, t.x, t.y);
    s = r * t.y;
    t.y = fmaf (-b.x, s, t.y);
    t.y = fmaf (r, t.y, s);
    c.x = t.x + t.y;
    c.y = (t.x - c.x) + t.y;
    return c;
}
njuffa
  • 23,970
  • 4
  • 78
  • 130
  • So my implementation already uses `fma` as part of the `twoProd` implementation (the paper advised against it, but it's my understanding that that was a consequence of some undesirable optimization behaviors seen on their target platform, and not a failure of the algorithm implementation), which is itself also folded into the `df64_mult` function as a result. So it's hard to gauge whether this actually constitutes a more efficient implementation than what I'm using, but I like having this as a reference. I might try to run performance testing with this as a comparison. – Xirema Feb 27 '20 at 21:42
  • @Xirema When done correctly (and Thall does do it correctly, as I recall, as opposed to many other sources), a single full `df64_add()` or `df64_sub()` requires something like 20 operations, i.e. more than the division algorithm by Nagai et. al. In your code there is truncation on either input or output, but the combined operation count of the `df64_add` and `df64_sub` is what is costly, even if the multiplies are implemented via FMA. You could always time the throughput on your platform. Early GPUs did *not* implement a proper FMA, thus probably the caveat in Thall's paper. – njuffa Feb 27 '20 at 22:31