How are IEEE-754 single and double precision formats determined?

Question

I'm interested in how these are determined:

Single precision has: 8 bits for e and rest (23 bits) are mantissa
Double precision: 11 bits for e and rest (52 bits) are mantissa ofc there is 1 bit for sign.

So how is it determined what number of bits is for mantissa, and what number of bits is for e. I guess this is noob question, but I would like to know the answer.

Because it says here: http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=4610933 — user1937198, Apr 14 '14 at 16:09
It is arbitrary. One day a group of engineers got together and decided on how best to represent floating point numbers. They came up with this format and hardware and software vendors went with it. — jliv902, Apr 14 '14 at 16:12
Excellent question actually. There are quite natural reasons for the total of 32 and 64 bits, because that was even 20 years ago what we expected to be a "natural" size for many years to come. For the distribution (8 + 23 or 11 + 52), the alternatives would have been for example 7+24 or 9+22. You win some precision and lose some range or the other way round. Someone just had to decide what the optimal point is. — gnasher729, Apr 14 '14 at 16:20
@jliv902 For a much more accurate description of how things happened, see http://www.cs.berkeley.edu/~wkahan/ieee754status/754story.html . Unfortunately that essay does not touches on the trade-offs in the attribution of bits to different formats, but these were of course also carefully weighted. — Pascal Cuoq, Apr 14 '14 at 16:22
[Why did IEEE 754 choose to allocate 23 bits to the manitssa and not 22 or 24 (etc.)?](https://stackoverflow.com/q/51777010/995714), [What is the rationale for exponent and mantissa sizes in IEEE floating point standards?](https://stackoverflow.com/q/4397081/995714) — phuclv, Sep 21 '18 at 16:36
Possible duplicate of [What is the rationale for exponent and mantissa sizes in IEEE floating point standards?](https://stackoverflow.com/questions/4397081/what-is-the-rationale-for-exponent-and-mantissa-sizes-in-ieee-floating-point-sta) — phuclv, Sep 21 '18 at 16:36

phuclv · Answer 1 · 2020-09-30T04:23:23.847

If you develop a format for your own then you can decide how many bits for the exponent and mantissa depending on that you need more precision or a larger range. Since IEEE-754 is designed for general use, they must choose what's better in most situations

Before IEEE-754 there were lots of floating-point formats with different pros and cons, some of those are from DEC's. Initially DEC created the 32-bit F and 64-bit D formats for their VAX system, both have 8 bits for the exponent in order to represent all important physical constants, including the Plank constant (6.626070040 × 10^-34) and the Avogadro constant (6.022140857 × 10²³). But they quickly realized that the number is quite limited and overflow/underflow happen every now and then so they have to add 3 more bits to the exponent to create a new 64-bit G format. When Dr. Kahan wrote the IEEE-754 draft he "suggested that DEC VAX's floating-point be copied because it was very good for its time" and that's why IEEE-754 single and double precision have 8 and 11 bits in the exponent part respectively

Another rationale for the 64-bit format is to allow repeated multiplication without overflow

For the 64-bit format, the main consideration was range; as a minimum, the desire was that the product of any two 32-bit numbers should not overflow the 64-bit format. The final choice of exponent range provides that a product of eight 32-bit terms cannot overflow the 64-bit format — a possible boon to users of optimizing compilers which reorder the sequence of arithmetic operations from that specified by the careful programmer.

"A Proposed Standard for Binary Floating-Point Arithmetic", David Stephenson, IEEE Computer, Vol. 14, No. 3, March 1981, pp. 51-62

It's the same reason that various DSPs have a wider accumulator register, usually 40-bit to allow adding 32-bit values 256 times without overflow

In fact nowadays the rule for IEEE-754 interchange format the size for the exponent is round(4 log₂(k)) − 13 bits so every time we double the width of the type, the exponent will be have ~4 more bits which allows for 16 multiplications of the narrower type without overflow

In the 16-bit half-float format, as the range would be too narrow and the maximum value is even much smaller than the maximum 16-bit int value if using only 4 bits for the exponent, they use 5 bits instead. Half-floats are mainly used in computer graphics so probably the precision of 11 bits is enough, and they need bigger exponent for wider dynamic range.

For more details read Where did the free parameters of IEEE 754 come from?

How are IEEE-754 single and double precision formats determined?

1 Answers1

Linked