Number of bits assigned for double data type

Question

How many bits out of 64 is assigned to integer part and fractional part in double. Or is there any rule to specify it?

[What Every Computer Scientist Should Know About Floating-Point Arithmetic](https://ece.uwaterloo.ca/~dwharder/NumericalAnalysis/02Numerics/Double/paper.pdf). See also [this answer](http://programmers.stackexchange.com/questions/215065/can-anyone-explain-representation-of-float-in-memory/215126#215126). — John Bode, May 14 '15 at 13:10
Floating point does not have an integer and fractional part as such. It is like scientific notation. The normal numbers in the commonest double format have an 11 bit binary exponent, modifying a significand of the form 1.x, where x is 52 bits. — Patricia Shanahan, May 14 '15 at 14:18

John Bode · Answer 1 · 2015-05-14T21:33:46.990

^{Note: I know I already replied with a comment. This is for my own benefit as much as the OPs; I always learn something new when I try to explain it.}

Floating-point values (regardless of precision) are represented as follows:

    sign * significand * β^exp

where sign is 1 or -1, β is the base, exp is an integer exponent, and significand is a fraction. In this case, β is 2. For example, the real value 3.0 can be represented as 1.10₂ * 2¹, or 0.11₂ * 2², or even 0.011₂ * 2³.

Remember that a binary number is a sum of powers of 2, with powers decreasing from the left. For example, 101₂ is equivalent to 1 * 2² + 0 * 2¹ + 1 * 2⁰, which gives us the value 5. You can extend that past the radix point by using negative powers of 2, so 101.11₂ is equivalent to

1 * 2² + 0 * 2¹ + 1 * 2⁰ + 1 * 2^-1 + 1 * 2^-2

which gives us the decimal value 5.75. A floating-point number is normalized such that there's a single non-zero digit prior to the radix point, so instead of writing 5.75 as 101.11₂, we'd write it as 1.0111₂ * 2²

How is this encoded in a 32-bit or 64-bit binary format? The exact format depends on the platform; most modern platforms use the IEEE-754 specification (which also specifies the algorithms for floating-point arithmetic, as well as special values as infinity and Not A Number (NaN)), however some older platforms may use their own proprietary format (such as VAX G and H extended-precision floats). I think x86 also has a proprietary 80-bit format for intermediate calculations.

The general layout looks something like the following:

seeeeeeee...ffffffff....

where s represents the sign bit, e represents bits devoted to the exponent, and f represents bits devoted to the significand or fraction. The IEEE-754 32-bit single-precision layout is

seeeeeeeefffffffffffffffffffffff

This gives us an 8-bit exponent (which can represent the values -126 through 127) and a 22-bit significand (giving us roughly 6 to 7 significant decimal digits). A 0 in the sign bit represents a positive value, 1 represents negative. The exponent is encoded such that 00000001₂ represents -126, 01111111₂ represents 0, and 11111110₂ represents 127 (00000000₂ is reserved for representing 0 and "denormalized" numbers, while 11111111₂ is reserved for representing infinity and NaN). This format also assumes a hidden leading fraction bit that's always set to 1. Thus, our value 5.75, which we represent as 1.0111₂ * 2², would be encoded in a 32-bit single-precision float as

01000000101110000000000000000000
||      ||                     |
||      |+----------+----------+
||      |           |
|+--+---+           +------------ significand (1.0111, hidden leading bit)
|   |
|   +---------------------------- exponent (2)
+-------------------------------- sign (0, positive)

The IEEE-754 double-precision float uses 11 bits for the exponent (-1022 through 1023) and 52 bits for the significand. I'm not going to bother writing that out (this post is turning into a novel as it is).

Floating-point numbers have a greater range than integers because of the exponent; the exponent 127 only takes 8 bits to encode, but 2¹²⁷ represents a 38-digit decimal number. The more bits in the exponent, the greater the range of values that can be represented. The precision (the number of significant digits) is determined by the number of bits in the significand. The more bits in the significand, the more significant digits you can represent.

Most real values cannot be represented exactly as a floating-point number; you cannot squeeze an infinite number of values into a finite number of bits. Thus, there are gaps between representable floating point values, and most values will be approximations. To illustrate the problem, let's look at an 8-bit "quarter-precision" format:

seeeefff

This gives us an exponent between -7 and 8 (we're not going to worry about special values like infinity and NaN) and a 3-bit significand with a hidden leading bit. The larger our exponent gets, the wider the gap between representable values gets. Here's a table showing the issue. The left column is the significand; each additional column shows the values we can represent for the given exponent:

sig    -1        0        1        2        3        4        5
---    ----      -----    -----    -----    -----    -----    ----
000    0.5       1        2        4         8       16       32
001    0.5625    1.125    2.25     4.5       9       18       36
010    0.625     1.25     2.5      5        10       20       40
011    0.6875    1.375    2.75     5.5      11       22       44
100    0.75      1.5      3        6        12       24       48
101    0.8125    1.625    3.25     6.5      13       26       52
110    0.875     1.75     3.5      7        14       28       56
111    0.9375    1.875    3.75     7.5      15       30       60

Note that as we move towards larger values, the gap between representable values gets larger. We can represent 8 values between 0.5 and 1.0, with a gap of 0.0625 between each. We can represent 8 values between 1.0 and 2.0, with a gap of 0.125 between each. We can represent 8 values between 2.0 and 4.0, with a gap of 0.25 in between each. And so on. Note that we can represent all the positive integers up to 16, but we cannot represent the value 17 in this format; we simply don't have enough bits in the significand to do so. If we add the values 8 and 9 in this format, we'll get 16 as a result, which is a rounding error. If that result is used in any other computation, that rounding error will be compounded.

Note that some values cannot be represented exactly no matter how many bits you have in the significand. Just like 1/3 gives us the non-terminating decimal fraction 0.333333..., 1/10 gives us the non-terminating binary fraction 1.10011001100.... We would need an infinite number of bits in the significand to represent that value.

score 0 · Answer 2 · answered May 14 '15 at 11:57

0

a double on a 64 bit machine, has one sign bit, 11 exponent bits and 52 fractional bits.

think (1 sign bit) * (52 fractional bits) ^ (11 exponent bits)

answered May 14 '15 at 11:57

Andrew Malta

850
5
15

I have been going through this [link](http://stackoverflow.com/questions/1848700/biggest-integer-that-can-be-stored-in-a-double) I'm still not able to understand why the max value of double is 1.7E308 but taking 53 bit for integer part it's only amounts to 2^53. How are these 2 number related? – Austin Philip D Silva May 14 '15 at 12:35

Number of bits assigned for double data type

2 Answers2