Accurately predicting rounding error of cast between arbitrary floating-point formats

Question

Let’s assume you have a float64_t number with an arbitrary value and you want to find out if said number can safely be down-cast to a float32_t with the restriction that the resulting rounding error must not exceed a given epsilon.

A possible implementation could look like this:

float64_t before = 1.234567890123456789;
float64_t epsilon = 0.000000001;
float32_t mid = (float32_t)before;  // 1.2345678806304931640625
double after = (float64_t)mid; // 1.2345678806304931640625
double error = fabs(before - after); // 0.000000009492963526369635474111
bool success = error <= epsilon; // false

To make things more interesting though, let’s assume you’re not supposed to perform any actual type casting of the value at hand between those two types (as seen above).

And just to take it up a notch: Let’s assume you’re not casting down to float32_t, but a floating-point type of arbitrary precision (8bit, 16bit, 32bit, or maybe even 24bit) specified by its bit count and exponent length (and following the conventions of IEEE 754 such as rounding ties to even).

So what I’m looking for is a generic algorithm more akin to this:

float64_t value = 1.234567890123456789;
float64_t epsilon = 0.000000001;
int bits = 16;
int exponent = 5;
bool success = here_be_dragons(value, epsilon, bits, exponent); // false

To give an example down-casting the 64bit number 1.234567890123456789 to a lower precision results in the following rounding errors:

 8bit: 0.015432109876543309567864525889
16bit: 0.000192890123456690432135474111
24bit: 0.000005474134355809567864525889
32bit: 0.000000009492963526369635474111
40bit: 0.000000000179737780214850317861
48bit: 0.000000000001476818667356383230
56bit: 0.000000000000001110223024625157

What’s known:

The specifications for both precision types in question (one of lower precision than the other):
- total length (in bits) (would be 32 for float, e.g.)
- exponent length (in bits) (would be 8 for float, e.g.)
The min and max values of each type (as those can be derived from the above).
The number of positive normal values (excluding zero) (((2^exponent) - 2) * (2^mantissa))
The exponent’s bias ((2^(exponent - 1)) - 1)
The actual value (provided in the the given higher precision type).
The error threshold epsilon allowed for the down-cast to fall within in order to be considered successful (also provided in the the given higher precision type).

(An approximation of the expected error might be enough, depending on its accuracy and deviation factors. Exact calculations preferred though, obviously.)

Cases that need not be covered (as they are trivially solvable in isolation):

If the input value is of any non-normal value (subnormal, infinity, nan, zero, …), then the answer shall hereby be defined to be true.
If the input value falls outside the known boundaries (+- the given epsilon) of a given type of lower precision, then the answer shall hereby be defined to be false.

What I’ve thought up so far:

We know the count of positive normal values (excluding zero) in a given floating-point type and we know that the negative value space is symmetric to the positive one.

We also know that the distribution of discrete values within the value range (away from zero) follows an exponential function and its relative epsilon a related step function:

It should be possible to calculate which nth discrete normal value a given real value within the normal value range of a given floating-point type would fall onto (via some kind of logarithmic projection, or something?), shouldn’t it? Given this n one should then be able to calculate the epsilon of the corresponding value from its step function and compare it against the specified max-error, no?

I have the feeling that this might actually be enough to calculate (or at least accurately estimate) the expected casting error. I just have no idea how to put these things together.

How would you approach this? (Bonus points for actual code :P)

_{Ps: To give a bit more context: I’m working on a var_float implementation and in order to figure out the smallest losslessly (or lossy within a given epsilon) convertible representation for a given value I’m currently performing a binary search utilizing aforementioned naive round-trip logic to find the right size. It works, but it’s lacking in the efficiency and coolness department. Even though it by no means is a performance bottleneck (yada yada premature optimization yada yada), I'm curious as to whether one can find a more math-grounded and elegant solution. ;)}

Can't you just look if an appropriate amount of the last bits of the 8byte double are zero? Also check the exponent. The range of smaller formats is smaller. — Lutz Lehmann, Apr 15 '15 at 11:22
That would not work—I think—as "0.328125" can be losslessly cast to `float8_t` (`0 001 0101`) leaving no trailing zeros. The range issue is covered by my "need not be covered" rule. ;) — Regexident, Apr 15 '15 at 11:53
And in longer formats the mantissa of `21/64=(1.0101)_2*2^(-2)` continues with zeros. Which is exactly the condition required to compress it into the shorter format. — Lutz Lehmann, Apr 15 '15 at 12:02
Ah, I misread your comment. Thought you were specifically talking about the trailing bits in 8bit representation. Makes sense now (and kind of resembles DrKoch's answer, doesn't it?). — Regexident, Apr 15 '15 at 12:04
Yes, that's why I do not add my own answer. However, this method only applies to truly lossless compression. If some limited error is allowed, then more effort is required. — Lutz Lehmann, Apr 15 '15 at 12:06

score 4 · Accepted Answer · answered Apr 15 '15 at 12:42

4

Something like the following may work:

double isbad(double x, double releps) {
  double y = x * (1 + 0x1.0p29);
  double z = y-x-y+x;
  return !(fabs(z/x) < releps);
}

This uses a trick (due to Dekker, I believe) for splitting a floating-point number into a "big half" and a "little half" that sum exactly to the original number. I want the "big half" to have 23 bits and the "little half" to have the rest, so I split using the constant 1 + 2^(52-23).

Caveats: You need to handle the more limited exponent range by checking against upper and lower bounds. Subnormals (especially the case where the result is subnormal in the little type but not the big type) need different special handling. I wrote !(fabs(z/x) < releps) instead of fabs(z/x <= releps because I want NaNs to qualify as "bad." releps is a bad name for that variable, since the threshold is actually half an ulp larger than the number you specify when using round-to-nearest.

answered Apr 15 '15 at 12:42

tmyklebu

13,915
3
28
57

Wow, that's it! Changing your function's last line to `return fabs(z);` makes it return the exact error. Holy crap. This was the kind of magic I was hoping for. – Regexident Apr 15 '15 at 14:17
1

This page has two references and the earlier one is by Veltkamp (1968): http://proval.lri.fr/gallery/Dekker.en.html – Pascal Cuoq Apr 15 '15 at 14:22
@PascalCuoq: Do you have an electronic copy of that Veltkamp paper, or know where to find one? – tmyklebu Apr 15 '15 at 14:49
@tmyklebu I will ask the author of the webpage when I see her. – Pascal Cuoq Apr 15 '15 at 14:52

DrKoch · Answer 2 · 2015-04-15T12:01:43.813

2

Downcasting is equivalent to setting the least significant bits of the mantissa to zero.

So, for a given floating point number, just extract the least significant bits of the mantissa (width depending on the downcast-type) and scale with current exponent. This should (very precisely) be the "rounding error" which will occur in downcasting.

Edit

As mentioned in the comments thhe above is only true for 50% of all cases. (When downcasting leads to rounding down). In cases where downcasting leads to rounding up a slightly modified approach will help:

(extreme/corner cases: example: five digits of mantissa in downcasted type)

Rounding down: 0x1.00007fff -> 0x1.0000 
               -> Err == 0x0.00007fff

Rounding up:   0x1.00008000 -> 0x1.0001 -> Err == 0x1.00010000 - 0x1.00008000
               -> Err == 0x0.00008000

edited Apr 15 '15 at 12:01

answered Apr 15 '15 at 11:44

DrKoch

9,556
2
34
43

1

“Downcasting is equivalent to setting the least significant bits of the mantissa to zero” is not quite true, because the conversion from double to float **rounds** to the nearest float on most compilation platforms. What you have described is what would happen if conversion **truncated**. – Pascal Cuoq Apr 15 '15 at 11:51
For instance the error when rounding the double `0x1.000001fffffffp0` to `float` is millions times less than your method would lead one to believe, as the aforementioned `double` is simply converted to `0x1.000002p0f`. – Pascal Cuoq Apr 15 '15 at 11:53
Good point, my `var_float` type does indeed round ties to even. Should have mentioned that specifically I guess. Will update my question. – Regexident Apr 15 '15 at 11:56
But if the last bits are truly zero, then downcasting is just truncation. And if they are not zero, then downcasting gives a different number, irregardless of the rounding rule. – Lutz Lehmann Apr 15 '15 at 11:57

Accurately predicting rounding error of cast between arbitrary floating-point formats

2 Answers2