Truncating 64-bit IEEE doubles to 61-bits in a safe fashion

Question

I am developing a programming language, September, which uses a tagged variant type as its main value type. 3 bits are used for the type (integer, string, object, exception, etc.), and 61 bits are used for the actual value (the actual integer, pointer to the object, etc.).

Soon, it will be time to add a float type to the language. I almost have the space for a 64-bit double, so I wanted to make use of doubles for calculations internally. Since I'm actually 3 bits short for storage, I would have to round the doubles off after each calculation - essentially resulting in a 61-bit double with a mantissa or exponent shorter by 3 bits.

But! I know floating point is fraught with peril and doing things which sound sensible on paper can produce disastrous results with FP math, so I have an open-ended question to the experts out there:

Is this approach viable at all? Will I run into serious error-accumulation problems in long-running calculations by rounding at each step? Is there some specific way in which I could do the rounding in order to avoid that? Are there any special values that I won't be able to treat that way (subnormals come to mind)?

Ideally, I would like my floats to be as well-behaved as a native 61-bit double would be.

Pascal Cuoq · Accepted Answer · 2014-04-17T07:47:37.257

10

I would recommend borrowing bits from the exponent field of the double-precision format. This is the method described in this article (that you would modify to borrow 3 bits from the exponent instead of 1). With this approach, all computations that do not use very large or very small intermediate results behave exactly as the original double-precision computation would. Even computations that run into the subnormal region of the new format behave exactly as they would if a 1+8+52 61-bit format had been standardized by IEEE.

By contrast, naively borrowing any number of bits at all from the significand introduces many double-rounding problems, all the more frequent that you are rounding from a 52-bit significand to a significand with only a few bits removed. Borrowing one bit from the significand as you suggest in an edit to your question would be the worst, with half the operations statistically producing double-rounded results that are different from what the ideal “native 61-bit double” would have produced. This means that instead of being accurate to 0.5ULP, the basic operations would be accurate to ³/₄ULP, a dramatic loss of accuracy that would derail many of the existing, finely-designed numerical algorithms that expect 0.5ULP.

Three is a significant number of bits to borrow from an exponent that only has 11, though, and you could also consider using the single-precision 32-bit format in your language (calling the single-precision operations from the host).

Lastly, I give visibility here to another solution found by Jakub: borrow the three bits from the significand, and simulate round-to-odd for the intermediate double-precision computation before converting to the nearest number in 49-explicit-significand-bit, 11-exponent-bit format. If this way is chosen, it may useful to remark that the rounding itself to 49 bits of significand can be achieved with the following operations:

if ((repr & 7) == 4) 
  repr += (repr & 8) >> 1);   /* midpoint case */
else
  repr += 4;
repr &= ~(uint64_t)7; /* round to the nearest */

Despite working on the integer having the same representation as the double being considered, the above snippet works even if the number goes from normal to subnormal, from subnormal to normal, or from normal to infinite. You will of course want to set a tag in the three bits that have been freed as above. To recover a standard double-precision number from its unboxed representation, simply clear the tag with repr &= ~(uint64_t)7;.

edited Apr 17 '14 at 07:47

answered Apr 16 '14 at 14:43

Pascal Cuoq

79,187
7
161
281

Thank you for the answer. I thought about using 32-bit floats. I'm weighing my options, and there are 3 possibilities considered right now: a) a truncated 64-bit format stored inline, b) a 32-bit IEEE float, stored inline, c) a "boxed" 64/80-bit float - with a pointer to the actual value stored. If a) is going to be messy, I will have to choose one of the others. Now, off to read the article! :) – Jakub Wasilewski Apr 16 '14 at 14:48
1

@JakubWasilewski You can eliminate the double-rounding problem for basic operations (+, -, *, /, sqrt) by making sure the first result is computed with a significand at least twice as wide as the destination significand of the second rounding (“When is double rounding innocuous?”, Figueroa). If you can use the 80-bit format for actual computations, this allows you to invent your own format with up to 32-bit of significand. If you want to use double-precision computations, you really need to try to stick to a 53-bit significand. – Pascal Cuoq Apr 16 '14 at 14:53
Well, if the best I can do with 80-bit calculations is "a little better than a 32-bit float", I would probably prefer just using 32-bit floats directly - more performance, less hassle, almost as good precision. – Jakub Wasilewski Apr 16 '14 at 14:57
Just adding an opinion: not to put much effort on intermediate results. – Kahler Apr 16 '14 at 14:59
@JakubWasilewski This discussion of 80-bit FPU functionality reminds me that the width of the significand of the 387 can actually be set programmatically (in order to emulate almost exactly single- and double-precision). If the instruction that does this allows to choose any width (as opposed to only 24, 53 or 64), you could pick the exact width that works best for your 61-bit format. – Pascal Cuoq Apr 16 '14 at 15:04
@JakubWasilewski That is not possible according to this link: http://msdn.microsoft.com/en-us/library/e9b52ceh.aspx . You could make a super-large range format with 24 bits of significand and 15 bits of exponent, though. – Pascal Cuoq Apr 16 '14 at 15:10
1

After some research of my own, it seems double-rounding problems could be avoided by [using round-to-odd for the intermediate result](http://www.exploringbinary.com/gcc-avoids-double-rounding-errors-with-round-to-odd/) ([another link](https://www.lri.fr/~melquion/doc/05-imacs17_1-expose.pdf)). If I understand correctly, this would allow for a well-behaved 11-bit exponent + 49-bit mantissa float. I'm not sure yet how hard/portable it will be to force the intermediate calculations to be done with round-to-odd. – Jakub Wasilewski Apr 16 '14 at 18:17
It seems it would be possible to "implement" round-to-odd by using [`fesetround(FE_TOWARD_ZERO)`](http://linux.die.net/man/3/fesetround) and using the [`FE_INEXACT`](http://www.cplusplus.com/reference/cfenv/FE_INEXACT/) flag for adding the 'to-odd' part. – Jakub Wasilewski Apr 16 '14 at 18:30
@JakubWasilewski You still need to strip at least two significand bits in order for the round-to-odd trick to work. If you choose this path, you would naturally borrow the three bits from the significand. – Pascal Cuoq Apr 16 '14 at 19:48
Taking 3 bits from the exponent would result in a type with the range of 32-bit float but the precision of 64-bit. That seems very useful - many of the situations that need 64 bit do so because of precision, not range. – Patricia Shanahan Apr 17 '14 at 00:59
This is exactly what the 64bits Visualworks Smalltalk virtual machine is doing, so it must be a good answer. – aka.nice Apr 18 '14 at 17:57
@aka.nice Which solution, the exponent one, the single-precision one or the borrow-at-least-two-bits-from-significand-and-use-round-to-odd one? Also I love how IEEE 754 and common implementation techniques for high-level languages give a common vocabulary for people with different backgrounds to understand each other. – Pascal Cuoq Apr 18 '14 at 18:04
@PascalCuoq VIsualworks have a SmallDouble from which they still 3 bits from exponent. Then they isolate sign, shift the result <<3 and tag it with a bit pattern & 0x4 to distinguish from every other object which are 8 bytes aligned pointer, and from tagged SmallIntegers(& 0x1) and tagged Characters (& 0x2) This way, no loss of precision, behaves like a normal Double. – aka.nice Apr 18 '14 at 18:13
1

@PascalCuoq but of course, another solution would be to tag with &0x2 shift <<2 and steal only 2 bits from exponent, and tag characters with 0x4 because encoding ain't gonna require 62 bits – aka.nice Apr 18 '14 at 18:16

score 0 · Answer 2 · edited May 23 '17 at 10:26

This is a summary of my own research and information from the excellent answer by @Pascal Cuoq.

There are two places where we can truncate the 3-bits we need: the exponent, and the mantissa (significand). Both approaches run into problems which have to be explicitly handled in order for the calculations to behave as if we used a hypothetical native 61-bit IEEE format.

Truncating the mantissa

We shorten the mantissa by 3 bits, resulting in a 1s+11e+49m format. When we do that, performing calculations in double-precision and then rounding after each computation exposes us to double rounding problems. Fortunately, double rounding can be avoided by using a special rounding mode (round-to-odd) for the intermediate computations. There is an academic paper describing the approach and proving its correctness for all doubles - as long as we truncate at least 2 bits.

Portable implementation in C99 is straightforward. Since round-to-odd is not one of the available rounding modes, we emulate it by using fesetround(FE_TOWARD_ZERO), and then setting the last bit if the FE_INEXACT exception occurs. After computing the final double this way, we simply round to nearest for storage.

The format of the resulting float loses about 1 significant (decimal) digit compared to a full 64-bit double (from 15-17 digits to 14-16).

Truncating the exponent

We take 3 bits from the exponent, resulting in a 1s+8e+52m format. This approach (applied to a hypothetical introduction of 63-bit floats in OCaml) is described in an article. Since we reduce the range, we have to handle out-of-range exponents on both the positive side (by simply 'rounding' them to infinity) and the negative side. Doing this correctly on the negative side requires biasing the inputs to any operation in order to ensure that we get subnormals in the 64-bit computation whenever the 61-bit result needs to be subnormal. This has to be done a bit differently for each operation, since what matters is not whether the operands are subnormal, but whether we expect the result to be (in 61-bit).

The resulting format has significantly reduced range since we borrow a whopping 3 out of 11 bits of the exponent. The range goes down from 10^-308...10³⁰⁸ to about 10^-38 to 10³⁸. Seems OK for computation, but we still lose a lot.

Comparison

Both approaches yield a well-behaved 61-bit float. I'm personally leaning towards truncating the mantissa, for three reasons:

the "fix-up" operations for round-to-odd are simpler, do not differ from operation to operation, and can be done after the computation
there is a proof of mathematical correctness of this approach
giving up one significant digit seems less impactful than giving up a big chunk of the double's range

Still, for some uses, truncating the exponent might be more attractive (especially if we care more about precision than range).

It is not “Occam”, it is “OCaml”. And the 63-bit approach is not used there, it is the subject of a tongue-in-cheek proposal written by someone with a blog and no direct relation to the developers (me). — Pascal Cuoq, Apr 17 '14 at 11:35
Fixed and clarified that its a hypothetical approach. I left this answer here as a sort of executive summary of all the information we gathered on the topic and the trade-offs. We'll see if it gets voted up as useful. — Jakub Wasilewski, Apr 17 '14 at 11:40

Truncating 64-bit IEEE doubles to 61-bits in a safe fashion

2 Answers2

Truncating the mantissa

Truncating the exponent

Comparison