What is the rationale for exponent and mantissa sizes in IEEE floating point standards?

Question

I have a decent understanding of how floating point works, but I want to know how the specific exponent and mantissa sizes were decided upon. Are they optimal in some way? How can optimality be measured for floating point representations (I assume there are several ways)? I imagine these issues are addressed in the official standard, but I don't have access to it.

Does this answer your question? [How are IEEE-754 single and double precision formats determined?](https://stackoverflow.com/questions/23064893/how-are-ieee-754-single-and-double-precision-formats-determined) — phuclv, Sep 30 '20 at 12:56

score 3 · Answer 1 · answered Dec 12 '14 at 13:27

3

According to this interview with Will Kahan, they were based on the VAX F and G formats of the era.

Of course that doesn't answer the question of how those formats were chosen...

answered Dec 12 '14 at 13:27

Simon Byrne

7,694
1
26
50

While I have never seen a published rationale for the VAX floating-point formats, I always imagined that the exponent range of the F format was chosen so as to allow the representation of all important physical constants, including the Plank constant (6.626070040 x 10**-34), and the Avogadro constant (6.022140857 x 10**23). Pure conjecture, of course. – njuffa Dec 15 '15 at 17:24
1

An internet search led me to this rationale for the VAX's F and D floating-point formats as originally designed for the PDP-11: [PDP-11/40 Technical Memorandum #16](https://ia601604.us.archive.org/26/items/bitsavers_decpdp11meoatingPointFormat_1047674/701110_The_PDP-11_Floating_Point_Format.pdf). The discussion of the exponent range of the F format specifically mentions the Plank and Avogadro constants. – njuffa Dec 15 '15 at 19:15
In [NA Digest Sunday, February 16, 1992 Volume 92 : Issue 7](http://www.netlib.org/na-digest-html/92/v92n07.html), James Demmel relates issues with the VAX's D format due to the narrow exponent range with respect to LAPACK, but it is not clear from the discussion how these kind of issues specifically led to the choice of 11 exponent bits in the VAX's G format. – njuffa Dec 15 '15 at 19:26
1

[D. Stevenson, A Proposed Standard for Binary Floating-Point Arithmetic](http://www.computer.org/csdl/mags/co/1981/03/01667284.pdf) explains the choice of exponent bits for the double-precision format as follows: "The final choice of exponent range provides that a product of eight 32-bit terms cannot overflow the 64-bit format -- a possible boon to users of optimizing compilers which reorder the sequence of arithmetic operations from that specified by the careful programmer." – njuffa Dec 15 '15 at 19:40

Johan Kotlinski · Answer 2 · 2010-12-09T10:30:39.517

0

For 32-bit IEEE floats, the reasoning is that the precision should be at least as good as 24 bits fixed point.

Why exactly 24 bits, I don't know, but it seems like a reasonable tradeoff.

I suppose having a nice "round" number like that (mantissa + sign = 3 bytes, exponent = 1 byte) can also make implementations more efficient.

edited Dec 09 '10 at 10:30

answered Dec 09 '10 at 10:21

Johan Kotlinski

25,185
9
78
101

Splitting things into bytes helps enormously with implementations. Splitting things as 8+56 or 16+48 would also have with implementation, but an 8-bit exponent would be a bit on the small side, and a 16-bit exponent would represent a waste of bits. – supercat May 07 '14 at 19:45

What is the rationale for exponent and mantissa sizes in IEEE floating point standards?

2 Answers2

Linked