How can we clearly know the precision in double or float in C/C++?

Question

Suppose we have a real number a which has infinite precision. Now, we have floating type double or float in C/C++ and want to represent a using those types. Let's say "a_f" is the name of the variable for a.

I already understood how the values are represented, which consists of the following three parts: sign, fraction, and exponent. Depending on what types are used, the number of bits assigned for fraction and exponent differ and that determines the "precision".

How is the precision defined in this sense?

Is that the upper bound of absolute difference between a and a_f (|a - a_f|), or is that anything else?

In the case of double, why is the "precision" bounded by 2^{-54}??

Thank you.

The precision is determined by the number of bits in the mantissa. Nothing to do with the exponent. For IEEE754 `double` the answer is 53 buts, because that's the way it a defined. — user207421, Jul 25 '19 at 10:07
The standard only places a minimum restriction. The actual precision is implementation defined. — L. F., Jul 25 '19 at 10:08
@user9414424 The 'precision' is what your question is about. If you don't know what it is I don't understand what you're really asking. — user207421, Jul 25 '19 at 10:13
The precision is the number of bits assigned to the fraction. The number of bits assigned to the exponent has nothing to do with it. — john, Jul 25 '19 at 10:15
The precision is not bounded by 2^(-54). That seems like you've read something but not understood it. — john, Jul 25 '19 at 10:16
@user207421 Hmm ... C11 [5.2.4.2.2](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf#%5B%7B%22num%22%3A94%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22XYZ%22%7D%2C-27%2C816%2Cnull%5D) maybe. — L. F., Jul 25 '19 at 10:21
@L.F. My comment explicitly referred to IEEE 754, which is as I described it. You are now referring to the C-2011 draft standard. The question is about C++. — user207421, Jul 25 '19 at 10:24
@user207421 Uh, I thought you knew http://eel.is/c++draft/cfloat.syn#1 ... — L. F., Jul 25 '19 at 10:27
@L.F. When I am specifically talking about IEEE 754 I do not expect to be referred to *another* standard, without citation, as 'the' standard. Please don't add to the confusion. — user207421, Jul 25 '19 at 10:30

score 2 · Accepted Answer · answered Jul 25 '19 at 11:43

2

The precision of floating point types is normally defined in terms of the the number of digits in the mantissa, which can be obtained using std::numeric_limits<T>::digits (where T is the floating point type of interest - float, double, etc).

The number of digits in the mantissa is defined in terms of the radix, obtained using std::numeric_limits<T>::radix.

Both the number of digits and radix of floating point types are implementation defined. I'm not aware of any real-world implementation that supports a floating point radix other than 2 (but the C++ standard doesn't require that).

If the radix is 2 std::numeric_limits<T>::digits is the number of bits (i.e. base two digits), and that defines the precision of the floating point type. For IEEE754 double precision types, that works out to 54 bits precision - but the C++ standard does not require an implementation to use IEEE floating point representations.

When storing a real value a in a floating point variable, the actual variable stored (what you're describing as a_f) is the nearest approximation that can be represented (assuming effects like overflow do not occur). The difference (or magnitude of the difference) between the two does not only depend on the mantissa - it also depends on the floating point exponent - so there is no fixed upper bound.

Practically (in very inaccurate terms) the possible difference between a value and its floating point approximation is related to the magnitude of the value. Floating point variables do not represent a uniformly distributed set of values between the minimum and maximum representable values - this is a trade-off of representation using a mantissa and exponent, which is necessary to be able to represent a larger range of values than a integral type of the same size.

answered Jul 25 '19 at 11:43

Peter

35,646
4
32
74

Neither the C nor the C++ standard requires that representable value be the result of converting a numeral to a floating-point type. They only require that, if the number is within range, either the nearest higher or nearest lower representable value be chosen, in an implementation-defined manner. – Eric Postpischil Jul 25 '19 at 14:11
“Significand” is the preferred term for the fraction portion of a floating-point representation. “Mantissa” is an old term for the fraction portion of a logarithm. Significands are linear. Mantissas are logarithmic. – Eric Postpischil Jul 25 '19 at 14:12
It is imprecise to refer to “storing” a real number in a floating-point value. This is better understood (and is defined in the IEEE-754 standard) as a conversion operation: A number represented as, say, a decimal numeral in source code or other string is converted from that decimal format to a floating-point format. Similarly to any arithmetic operation like multiplication, the real number result is rounded to a representable value. This model is more useful when analyzing and designing floating-point operations and writing proofs. Thinking of floating-point as “storing” a real number… – Eric Postpischil Jul 25 '19 at 14:15
… leads to a mistaken model of a floating-point number as representing a fuzzy range of real numbers, which then leads to incorrect deductions about how they behave. Each floating-point number represents one real number exactly. If a real number cannot be represented, it is not “stored” as a floating-point number; it is converted, with rounding. – Eric Postpischil Jul 25 '19 at 14:15
@EricPostpischil - the question was imprecise, and writing a precise/complete answer would take more time than I have to check the details and write one. IEEE-754 used (introduced?) the term "significand" to address the distinction you're highlighting, but older terms like mantissa are still commonly used - including in recent C++ standards - since IEEE-754 is not the only model of floating point. So, while I accept your observations, older terminology is still used, and I suspect not disappearing any time soon – Peter Jul 25 '19 at 14:52
@Peter: Assuming, for whatever reason, you want to choose your terminology based on popularity rather than precision and quality, the current C standard uses “significand” throughout and has no occurrence of “mantissa”, while the C++ standard has three uses of “significand” and one of “mantissa” (albeit a late draft; I do not have the official version). – Eric Postpischil Jul 25 '19 at 15:03
@PeteBecker IEEE-754 binare64 is 52 bits stored with 1 bit implied, so total of 53, see also https://stackoverflow.com/questions/18409496/is-it-52-or-53-bits-of-floating-point-precision – Mark Rotteveel Aug 13 '19 at 18:53
@MarkRotteveel — thanks. Removed. – Pete Becker Aug 13 '19 at 18:58

score 1 · Answer 2 · answered Jul 25 '19 at 10:26

The thing with floating points is that they get more innacurate the greater or smaller they are. For example:

double x1 = 10;
double x2 = 20;

std::cout << std::boolalpha << (x1 == x2);

prints, as expected, false.

However, the following code:

// the greatest number representable as double. #include <limits>
double x1 = std::numeric_limits<double>::max();
double x2 = x1 - 10;

std::cout << std::boolalpha << (x1 == x2);

prints, unexpectedly, true, since the numbers are so big that you can't meaingfully represent x1 - 10. It gets rounded to x1.

One may then ask where and what are the bounds. As we see the inconsistencies, we obvioulsy need some tools to inspect them. <limits> and <cmath> are your friends.

std::nextafter:

std::nextafter takes two floats or doubles. The first argument is our starting point and the second one represents the direction where we want to compute the next, representable value. For example, we can see that:

double x1 = 10;
double x2 = std::nextafter(x1, std::numeric_limits<double>::max());

std::cout << std::setprecision(std::numeric_limits<double>::digits) << x2;

x2 is slightly more than 10. On the other hand:

double x1 = std::numeric_limits<double>::max();
double x2 = std::nextafter(x1, std::numeric_limits<double>::lowest());

std::cout << std::setprecision(std::numeric_limits<double>::digits)
          << x1 << '\n' << x2;

Outputs on my machine:

1.79769313486231570814527423731704356798070567525845e+308
1.7976931348623155085612432838450624023434343715745934e+308
                 ^ difference

This is only 16th decimal place. Considering that this number is multiplied by 10³⁰⁸, you can see why dividing 10 changed absolutely nothing.

It's tough to talk about specific values. One may estimate that doubles have 15 digits of precision (combined before and after dot) and it's a decent estimation, however, if you want to be sure, use convenient tools designed for this specific task.

In the first example, both numbers are perfectly accurate. Their size doesn't affect this. It's **subtracting** two numbers with widely different values that leads to problems. And in the real world, comparing numbers that are so widely different is usually meaningless, too. Real numbers are capable of representing differences like that, but the result isn't useful. — Pete Becker, Jul 25 '19 at 12:26
@PeteBecker I dare to disagree with "*Real numbers are capable of representing differences like that, but the result isn't useful*". Computations that are used in physics do care about such subtle differenes. But if you care about precision that much, you shouldn't be using `double` in the first place. — Fureeish, Jul 25 '19 at 13:15
You're talking about differences on the order of one part in 10^300, which is far beyond the precision of any instrument today. Differences of that magnitude are far below the noise level of any measurement. — Pete Becker, Jul 25 '19 at 15:22
@PeteBecker resurgenge theory and asimptotic series expansions in physics require these kind of values and differences unrepresentable by doubles, but I believe this is not a place for such discussions. — Fureeish, Jul 25 '19 at 18:42

score 0 · Answer 3 · answered Jul 25 '19 at 10:18

For instance, number 123456789 may be represented as .12 * 10^9 or maybe .12345 * 10^9 or .1234567 * 10^9. None of these is an exact representation and some are better than the others. Which one you go with depends on how many bits you have for the fraction. More bits means more precision. The number of bits used to represent the fraction is called the "precision".

How can we clearly know the precision in double or float in C/C++?

3 Answers3

Linked