2

Given two IEEE-754 double-precision floating-point numbers a and b, I want to get the exact quotient a/b rounded to an integer towards zero.
A C99 program to do that could look like this:

#include <fenv.h>
#include <math.h>
#pragma STDC FENV_ACCESS on

double trunc_div(double a, double b) {
  int old_mode = fegetround();
  fesetround(FE_TOWARDZERO);
  double result = a/b;  // rounding occurs here
  fesetround(old_mode);
  return trunc(result);
}

#include <stdio.h>
int main() {
  // should print "6004799503160662" because 18014398509481988 / 3 = 6004799503160662.666...
  printf("%.17g", trunc_div(18014398509481988.0, 3.0));
}

Now suppose I only have access to the nearest-ties-to-even rounding mode: I could be using GCC with optimizations, compiling for a microcontroller, or having to make it work in JavaScript.

What I've tried is to compute a/b with the provided rounding, truncate, and compensate if the magnitude of the result is too large:

double trunc_div(double a, double b) {
  double result = trunc(a/b);
  double prod = result * b;
  
  if (a > 0) {
    if (prod > a || (prod == a && mul_error(result, b) > 0)) {
      result = trunc(nextafter(result, 0.0));
    }
  }
  else {
    if (prod < a || (prod == a && mul_error(result, b) < 0)) {
      result = trunc(nextafter(result, 0.0));
    }
  }

  return result;
}

The helper function mul_error computes the exact multiplication error (using Veltkamp-Dekker splitting):

// Return the 26 most significant bits of a.
// Assume fabs(a) < 1e300 so that the multiplication doesn't overflow.
double highbits(double a) {
  double p = 0x8000001L * a;
  double q = a - p;
  return p + q;
}

// Compute the exact error of a * b.
double mul_error(double a, double b) {
  if (!isfinite(a*b)) return -a*b;
  int a_exp, b_exp;
  a = frexp(a, &a_exp);
  b = frexp(b, &b_exp);
  double ah = highbits(a), al = a - ah;
  double bh = highbits(b), bl = b - bh;
  double p = a*b;
  double e = ah*bh - p;  // The following multiplications are exact.
  e += ah*bl;
  e += al*bh;
  e += al*bl;
  return ldexp(e, a_exp + b_exp);
}

Can the compensation fail for some inputs (for example, due to overflow or underflow)?
Is there a faster way?


Edit: Changed the first line of mul_error from … return a*b to … return -a*b;. This fixes the cases where a = ±∞; finite inputs were OK.
Thanks to Eric Postpischil for catching the error.


Edit: If a, b are finite and non-zero and the division a/b overflows, I'd like to match IEEE-754 division in round-to-zero mode, which returns the maximum finite double-precision number ±(2¹⁰²⁴ − 2⁹⁷¹).


Edit: The functions frexp and ldexp can be called only when necessary.
That's a 30% speedup on doubles a, b with uniformly random bits.

double mul_error(double a, double b) {
  if (!isfinite(a*b)) return -a*b;
  double A = fabs(a), B = fabs(b);
  // bounds from http://proval.lri.fr/gallery/Dekker.en.html
  if (A>0x1p995 || B>0x1p995 || (A*B!=0 && (A*B<0x1p-969 || A*B>0x1p1021))) {
    // ... can overflow/underflow: use frexp, ldexp
  } else {
    // ... no need for frexp, ldexp
  }
}

Maybe ldexp is always unnecessary because we only need to know how mul_error compares to 0.


Edit: Here's how to do it if you have 128-bit integers available. (It's slower than the original version.)

double trunc_div(double a, double b) {
  typedef uint64_t u64;
  typedef unsigned __int128 u128;

  if (!isfinite(a) || !isfinite(b) || a==0 || b==0) return a/b;

  int sign = signbit(a)==signbit(b) ? +1 : -1;
  int ea; u64 ua = frexp(fabs(a), &ea) * 0x20000000000000;
  int eb; u64 ub = frexp(fabs(b), &eb) * 0x20000000000000;
  int scale = ea-53 - eb;
  u64 r = ((u128)ua << 53) / ub;  // integer division truncates
  if (r & 0xFFE0000000000000) { r >>= 1; scale++; }  // normalize
  
  // Scale<0 means that we have fractional bits. Shift them out.
  double d = scale<-63 ? 0 : scale<0 ? r>>-scale : ldexp(r, scale);
  
  // Return the maximum finite double on overflow.
  return sign * (isfinite(d) ? d : 0x1.fffffffffffffp1023); 
}
Řrřola
  • 6,007
  • 1
  • 17
  • 10
  • If the truncated result can't be expressed as `double` (e.g. `2^100 / 3`), do you want to round that down (towards zero) or to the nearest even? – chtz Dec 07 '21 at 12:08
  • 1
    `mul_error` can be replaced by `double mul_error(double a, double b) { return fma(a, b, -a*b); }`. – Eric Postpischil Dec 07 '21 at 12:12
  • 1
    I am not sure you have the sign sense you want in `mul_error`. If `a*b` is finite, positive, and greater than `a`•`b`, it returns a negative value. But if `a*b` is +∞, it returns a positive value (+∞). (This would affect whether the `fma` show above is equivalent for the purposes used in this question.) – Eric Postpischil Dec 07 '21 at 12:36
  • @chtz: I want to round that toward zero. – Řrřola Dec 07 '21 at 12:41
  • @Eric Postpischil: Yes, fma is definitely the way to go if the platform supports it. I'll also go and correct mul_error for overflowing inputs. – Řrřola Dec 07 '21 at 12:45
  • @chtz - example: 18014398509481988.0 / 3.0 rounded to nearest-even is 6004799503160663 even before the trunc(), but I need 6004799503160662. – Řrřola Dec 07 '21 at 12:51
  • @chux - Reinstate Monica: `trunc((a - b/2) / b)` fails for example on (a=1, b=1): trunc(1.0/1.0) = 1, but trunc((1.0 - 1.0/2) / 1.0) = trunc(0.5) = 0. Other positive examples: (a=100, b=3) or (a=2, b=0.5). – Řrřola Dec 07 '21 at 18:45
  • @Řrřola Oh well. Idea removed. – chux - Reinstate Monica Dec 07 '21 at 18:59
  • The CPython interpreter uses something like `rint((a - fmod(a,b)) / b)`, but even that has bad cases – for example (a=10¹⁷, b=9) or (a=2⁶⁴, b=77). See `_float_div_mod()` in `floatobject.c`. – Řrřola Dec 07 '21 at 19:17
  • I suspect there is not much alternative to doing a floating-point division and then adjusting, as the division operation is the only feasible way to get a correctly rounded. If we took the significands and did an integer divide, there are issues in getting the desired number of bits in the quotient. With 128-bit integer arithmetic available (or at least somewhere around 106; I would have to analyze that), we could divide a left-adjusted numerator by a right-adjusted denominator and get a quotient with at last 53 bits, which could then be truncated to an integer (according to exponent scaling). – Eric Postpischil Dec 08 '21 at 20:00
  • So, if we are resolved to doing a floating-point division and then adjusting as needed, what remains is to check the particular implementation for bugs. – Eric Postpischil Dec 08 '21 at 20:02
  • @Řrřola How do you wish to handle overflow? For example (first failing test vector in my test scaffolding): `a= 0x1.000000fffffffp+216 b=-0x1.0000000000800p-934`. Your current code returns `result=-0x1.fffffffffffffp+1023`. But clearly that is not even close to the mathematical result. I would argue that if the result is not representable as a `double` because it is too large, one would want to return +/-INF to indicate this. – njuffa Dec 09 '21 at 06:21
  • @njuffa: In those cases, I'd like to match the behavior of IEEE-754 division in round-to-zero mode: even (0x1.fffffffffffffp+1023 / 0x1p-1074) rounds back to 0x1.fffffffffffffp+1023 even though the result is almost 2^2098. – Řrřola Dec 09 '21 at 10:23
  • 1
    @Řrřola In that case, your code works fine as is. It has passed 20+B test vectors of my test app so far. Obviously testing does not constitute proof. – njuffa Dec 09 '21 at 12:00

1 Answers1

0

Consider the exact remainder r=frem(a,b).

We know that a = b*n + r for some integer n, with r between -b/2 and b/2.

And a/b = n + r/b with r/b between -1/2 and 1/2 (/ is exact division here).

We can imagine 2 cases when float(a/b) would round to an upper integer part:

  • when the remainder is negative (opposite sign of n), and so small that float(n+r/b)=n
  • when n itself is too big to be represented as floating point

An example of 1st case is

a=ldexp(1.0,53); // 2^53, the successor of 2^53-1
b=nextafter(6361.0,7000.0); // close to exact division because 2^53-1=6361*69431*20394401
r=frem(a,b); // -0.287...

In this case, n=1416003655831 and float(a/b) rounds up to n, the residue -r/b being smaller than ulp(n).

Note that testing for a > 0 && fma(result,b,-a) > 0 is OK, but adjusting with nextafter(result,0.0) is not in this case, it would lead to an non integer result 1416003655830.999755859375. We should rather take result-1 when trunc(a/b) < 2^53.

For example of 2nd case take:

a=ldexp(1.0,54); // 2^54
b=nextafter(1.0,0.0);
r=frem(a,b); // 2.22...e-16

We have n being 2^54+2, the exact mid point between a and nextafter(2,2*a)
With positive remainder r, trunc(float(a/b)) will be rounded up to a+4.
And the discussion about sign of r shown in 1st case does not work here, so it can't be generalized...

Note that second case can always reduce to first case by appropriate scaling:

int exp,scale;
double result=a/b;
frexp(result,&exp);
scale=53-exp;
if(scale<0)
    return ldexp( trunc_div(ldexp(a,scale),b) , -scale );

But this has no practical interest, first case still require adjusting the result for case of round-up.

So, the adjustment can fail to answer an integer as we saw in 1st example, and this answer doesn't show a faster way, there's probably not much to gain.

aka.nice
  • 9,100
  • 1
  • 28
  • 40
  • Good case analysis. I'm now convinced that any correct solution either needs to find out whether float(_a_ / _b_) was rounded away from zero, or do the whole division in integer arithmetic. – Řrřola Dec 09 '21 at 13:48