2

I want to obtain the floating-point floor of the division of two positive floating-point numbers. In particular I'm after the greatest floating-point number not greater than the exact value of the floor of the division. The dividend can be big and the divisor small, but in my application there's no risk of overflow or underflow in the division.

If I do this:

quotient = floor(dividend / divisor);

I have the problem that, when the quotient is greater than the precision of the mantissa, the result of the division is always an integer, so the FPU rounds it rather than flooring it because it's in round-to-nearest-or-even mode; also floor() does nothing because it's fed an integer already. Since it's rounded, sometimes the result is greater than the exact floor, and that's not what I'm after.

Changing the FPU's rounding mode during the division would be a solution, but that is not an option, so barring that, how can I obtain the correct floor?

(Related: How to correctly floor the floating point pair sum)

Paul R
  • 208,748
  • 37
  • 389
  • 560
Pedro Gimeno
  • 2,837
  • 1
  • 25
  • 33
  • 2
    Have you already tried a correction step, such as `e=fma(-75.0, b, a); if (e < 0.0) b = nextafter (b, 0.0);` ? I am not saying this particular correction step always works, just looking for clarification what you have tried. – njuffa Aug 10 '22 at 23:19
  • There is no solution besides changing the rounding mode. You're dealing with numbers that are only approximations. In Python, you have the choice of doing your computations as integers (237261451793987450000000000000), which will produce an exact answer, but that may be making assertions about precision that you don't have. – Tim Roberts Aug 10 '22 at 23:19
  • To find the remainder, in Python use `%`: `2.3726145179398745e+29 % 75`. In C, use `fmod`: `fmod(2.3726145179398745e+29, 75)`. Assuming the IEEE-754 binary64 floating-point format is used, both of these produce 58, which is the correct remainder of 237261451793987452973306871808 modulo 75, and 237261451793987452973306871808 is the result of converting “2.3726145179398745e+29” to the binary64 format. For positive operands, a correctly implemented remainder has no rounding error. For any operands, the C correctly implemented `fmod` has no rounding error. – Eric Postpischil Aug 10 '22 at 23:20
  • (The Python `%` can have rounding error when the operands have different signs, as it may require returning a result greater than the first operand, putting it in a different exponent interval of the floating-point format.) – Eric Postpischil Aug 10 '22 at 23:22
  • 1
    Do you really want the floor or just the remainder? What do you want to do when the floor is not representable? floor(237261451793987452973306871808 / 75) is 3163486023919832706310758290, but that is not representable in binary64. The nearest representable value is 3163486023919832955533393920, and the nearest representable value below the floor is 3163486023919832405777580032. So, if you really want the `floor`, it is not possible without using extended precision arithmetic. – Eric Postpischil Aug 10 '22 at 23:23
  • I agree the problem in its present form is somewhat underspecified, and the asker should clarify the requirements. The exact remainder returned by a general-purpose `fmod()` function may not actually be required for the use case, and `fmod` tends to be slow when the divisor has much smaller magnitude than the dividend (looping implementation based on exponent difference). On the other hand, the application may only require a small number of divisors all known at compile time, which could allow some interesting optimization (I have not reasoned through the details). – njuffa Aug 10 '22 at 23:40
  • @njuffa No I haven't tried it yet. Is there a possible reason for that not to work? In my use case, the divisor is always an integer between 10 and 100. I wanted a general solution, but it would be interesting to have a variant with an optimization. And to all posters, I need both quotient and remainder, not just remainder, and I've updated the post to clarify what I mean by "correct". – Pedro Gimeno Aug 11 '22 at 01:25
  • @PedroGimeno See Eric Postpischil's most recent comment. The actual floor of the quotient might not be representable. What should happen in that case? If you use the kind of correction step I brought up, that should deliver the nearest representable value less or equal to the floor (I haven't actually shown that or tested it; I may be missing something). But the remainder computed using that will not necessarily represent the true mathematical remainder. If dividends and divisors are always integers, maybe look into using 128-bit integer division (the range may not be sufficient)? – njuffa Aug 11 '22 at 01:58
  • @njuffa I don't care about the precision of the remainder, as long as it is not negative. That's why the question specifically mentions the floor of the division and does not mention remainders. – Pedro Gimeno Aug 11 '22 at 20:35
  • Apologies for the irrelevant details that I introduced in my original formulation. I've heavily reworded the question to ask the same thing without the irrelevant details. I've also restored the original title which uses the same search terms I used when looking for a solution, to hopefully help others find an answer if one is posted. – Pedro Gimeno Aug 11 '22 at 21:47

1 Answers1

0

I ended up doing the division using integers. The functions below are only suitable for IEC-559 floats or doubles:

#include <stdint.h>
#include <math.h>

#ifdef __GNUC__
#define int_fast128 __int128
// other compilers pending
#endif

double truncdiv(double a, double b)
{
  int ae, be, re, sh, sh2;
  int_fast64_t am, bm;
  int_fast64_t rm;
  am = 9007199254740992. * frexp(a, &ae);
  bm = 9007199254740992. * frexp(b, &be);
  sh = 52 + (am < bm);  // add 1 if quotient is 1 bit short
  re = ae - be - sh;
  // Truncate the mantissa when the exponent is in range -52..0
  sh2 = re >= 0 ? 0 : -re;
  rm = re < -52 ? 0 : (((int_fast128)am << sh) / bm) >> sh2 << sh2;
  return ldexp(rm, re);
}

Note that this function is not written to handle signed zero, NaNs, infinities, denormals, overflow or division by zero. It is also truncation division rather than floor division, i.e. it rounds towards zero, not towards minus infinity. It requires a 128-bit integer type, which may not be available in all platforms. For single precision it would require only a 64-bit integer type, which is more widely supported:

#include <stdint.h>
#include <math.h>

float truncdivf(float a, float b)
{
  int ae, be, re, sh, sh2;
  int_fast32_t am, bm;
  int_fast32_t rm;
  am = 16777216.f * frexpf(a, &ae);
  bm = 16777216.f * frexpf(b, &be);
  sh = 23 + (am < bm);  // add 1 if quotient is 1 bit short
  re = ae - be - sh;
  // Truncate the mantissa when the exponent is in range -23..0
  sh2 = re >= 0 ? 0 : -re;
  rm = re < -23 ? 0 : (((int_fast64_t)am << sh) / bm) >> sh2 << sh2;
  return ldexpf(rm, re);
}
Pedro Gimeno
  • 2,837
  • 1
  • 25
  • 33