Could we remove floating point errors for numbers if we changed the format to that similar for double IEEE 754-1985?

Question

Currently IIRC, the current approach for displaying floating point numbers is to show them as 1/2 + 1/4 + 1/8 .... However what if we changed our approach to floating point numbers such that any floating point number is actually a normal integer, padded back by a series of 0's. Each number would would have to be larger, similar to the 62bit double.

For the 62bit double, we have 11 bits reserved for the exponent and 53 bits for the actual number. Now, what we could instead do is have one number represent the amount of "zeros" we have it padded back by. In this example we could have 11 as the padding bits, that mean we have (2 ^ 11) - 1 digits of accuracy for a 53 bit number.

Suppose I want to display 0.4, currently in Python we know 0.4 has floating point issues, for example,

>>> import decimal
>>> decimal.Decimal(0.4)
Decimal('0.40000000000000002220446049250313080847263336181640625')

However with my encoding, this will not happen, why? Because I can represent the number 4 with traditional binary 100 and the amount of exceeding zeros as the binary number 1, 01. This means I can represent the number 0.4 without any floating point issues by the number,

0 00000000001 00000000000000000000000000000000000000000000000000100

First bit reserved for sign, next 11 for zero padding and 53 for the number. It requires more bits, but I can now represent a number up to 2 ^ 11 digits of length with accuracy. Not only this, the wikipedia page suggest the C++ double is only 16 digits accurate, which means mine is 2048 - 16 digits more accurate!

Now try to take a square root, or even just divide by 3. Decimal isn't magic. It's just decimal. — user2357112, Oct 14 '20 at 06:19
"I can now represent a number up to 2 ^ 11 digits of length with accuracy" <- Either this doesn't follow, or I'm misunderstanding you. To take something with just 19 significant digits: could you explain how you'd represent the number `1.234567890123456789` in your proposed format? — Mark Dickinson, Oct 14 '20 at 06:56
Re “approach for displaying floating point numbers”: Numbers are represented by a floating-point format, not displayed. — Eric Postpischil, Oct 14 '20 at 11:11
Re “normal integer, padded back by a series of 0's”: The exponent scales the number; it does not pad it. And the various floating-point formats may already be interpreted as an integer scaled by an exponent, so it is not clear what change you propose. Perhaps you mean to scale by a power of ten instead of a power of 2. There are already such decimal-based formats. They do not eliminate errors. — Eric Postpischil, Oct 14 '20 at 11:12
Re “For the 62bit double, we have 11 bits reserved for the exponent and 53 bits for the actual number”: 11 plus 53 is 64, not 62. And the IEEE-754 format used for `double` has 1 sign bit, 11 exponent bits, and 52 bits for the primary significand field, with another bit of the significand derived from the exponent. — Eric Postpischil, Oct 14 '20 at 11:14
Re “In this example we could have 11 as the padding bits, that mean we have (2 ^ 11) - 1 digits of accuracy for a 53 bit number”: Use 11 bits for a decimal exponent gives a range of 2048 in the exponent of 10; e.g., the scale could range from 10^-1023 to 10^+1024, although we might want to reserve one or two values for infinities, NaNs, and subnormals. But it does not change the accuracy or precision. That is determined by the 53 bits of the significand. For a decimal format, those 53 bits will provide slightly less accuracy than binary, due to some inefficiency. — Eric Postpischil, Oct 14 '20 at 11:18
Re “First bit reserved for sign, next 11 for zero padding and 53 for the number”: One plus 11 plus 53 is 65. — Eric Postpischil, Oct 14 '20 at 11:18
Re “C++ double is only 16 digits accurate, which means mine is 2048 - 16 digits more accurate!”: Using the exponent field for scaling by a power of 10 will not give 2048 digits in the numbers. The number of decimal digits in each number will be limited by the 53-bit significand. — Eric Postpischil, Oct 14 '20 at 11:20

score 1 · Answer 1 · answered Oct 14 '20 at 06:13

It's odd that you specifically mention IEEE 754-1985 because IEEE 754-2008 already introduce decimal arithmetics. Your proposed scheme has a much smaller range compared to double, which makes it unsuitable for scientific calculation. Indeed, the decimal calculations are frequently reserved for financial calculation, because even in casual life, we rarely deal with absolute precision. We can have exactly 3 cows in the field, but their weight? Their price might seems absolutely precise, but after you calculate the sale tax you owed?

IEEE 754-2008 introduces decimal64, where the maximum significant digits are still 16. Even on science's domain (where the decimal arithmetics isn't appropriate), NASA's interplanetary flight rely on humble 3.141592653589793 for pi, cut at 15th decimal point. Oh but you want financial calculation? Well .NET use 128-bit decimal which gives out 28-29 digits precision, and financial institutions around the world happily adopt .NET decimals without bothering with other fancy schemes. decimal128 exist and have 34 digits accuracy.

Also, your scheme can't possibly have 2048-16 digits accuracy. You only assign 53 bits for the number while .NET Decimal assigned 96 bits, and since your scheme is pretty much similar

The binary representation of a Decimal value consists of a 1-bit sign, a 96-bit integer number, and a scaling factor used to divide the 96-bit integer and specify what portion of it is a decimal fraction. The scaling factor is implicitly the number 10, raised to an exponent ranging from 0 to 28.

the accuracy would've been in the low end between decimal64's 16 digits (utilizing 50 bits) and .NET Decimal's 28 digits. In practice, normal user don't do billions of financial calculation daily, so consumer's CPUs don't bother to adopt IEEE 754-2008, and since the only people asking for them bought IBM's Power CPU to stick in their server, don't expect native hardware and integrated (as in, standard, not an extra library) language support coming anytime soon

score 0 · Answer 2 · answered Oct 14 '20 at 09:58

Arbitrary precision software libraries do exist. The disadvantage of using them is speed. Even then you will never be able to represent numbers that have an infinite number of recurring digits.

You can also define your own fixed point encoding using integral types. As already said, you will be trading off precision for range.

in Python we know 0.4 has floating point issues

I am not aware that Python, or any other language, has issues with 0.4. Everything is perfectly well defined and deterministic.

Could we remove floating point errors for numbers if we changed the format to that similar for double IEEE 754-1985?

2 Answers2