How computer does floating point arithmetic?

Question

I have seen long articles explaining how floating point numbers can be stored and how the arithmetic of those numbers is being done, but please briefly explain why when I write

cout << 1.0 / 3.0 <<endl;

I see 0.333333, but when I write

cout << 1.0 / 3.0 + 1.0 / 3.0 + 1.0 / 3.0 << endl;

I see 1.

How does the computer do this? Please explain just this simple example. It is enough for me.

@close voters: I don't see anything in this question that would be off-topic or overly broad. If you do, please explain yourselves in comments. — Marc Mutz - mmutz, May 17 '11 at 16:33
I am surprised what is making angry these four close-voters too :)! — Narek, May 17 '11 at 17:03
By the way, almost every answer got it wrong. Most answers assumed that the FPU gets the bits wrong but that output conversion rounds to the right value. Not true at all. In both decimal and binary, the rational-but-repeating fraction 1/3 cannot be exactly represented, but when added twice it does round to exactly the right answer. Where many answers went wrong: this rounding happens in the last operation, the second add, and its happens in the least significant (2**-23 or 2**-53) bit. It results in an exactly precise 1.0 (0x3f800000) no matter how the output conversion is done. — DigitalRoss, May 19 '11 at 16:18

score 28 · Answer 1 · answered May 17 '11 at 15:28

28

Check out the article on "What every computer scientist should know about floating point arithmetic"

answered May 17 '11 at 15:28

Prasoon Saurav

91,295
49
239
345

I have seen this article with some theorems and proofs. Seems it is not too brief, hence I have asked this question! :) – Narek May 17 '11 at 15:32
9

@Narek: floating point arithmetic is *not* simple, and if you intend to do anything non trivial, you definitely should read this. I suggest you ask specific questions about it, instead of asking a question that broad. – Alexandre C. May 17 '11 at 15:34
2

@Alexandre C, for now I do not need to the whole theory. I just need to understand this simple example. – Narek May 17 '11 at 15:37
9

@Narek: you need it to understand this simple example. – Alexandre C. May 17 '11 at 16:07
Ok, then I need to postpone it by uncertain time, till I'll have free time for this! :( – Narek May 17 '11 at 16:58
Computers are quite powerful now to represent real numbers as arrays of integers in 10 base (like BigDecimal in Java), this will solve problem with precision for instance for numbers like 0.1 which are used much more than 2^n + 2^(n-1) + ..., why there is no strong trend this direction? – Nutel May 19 '11 at 19:05
@Nutel: Ask that as a new question instead. I'm intrigued. :) – Macke May 20 '11 at 08:42

DigitalRoss · Accepted Answer · 2016-09-06T20:33:22.763

The problem is that the floating point format represents fractions in base 2.

The first fraction bit is ½, the second ¼, and it goes on as 1 / 2ⁿ.

And the problem with that is that not every rational number (a number that can be expressed as the ratio of two integers) actually has a finite representation in this base 2 format.

(This makes the floating point format difficult to use for monetary values. Although these values are always rational numbers (n/100) only .00, .25, .50, and .75 actually have exact representations in any number of digits of a base two fraction. )

Anyway, when you add them back, the system eventually gets a chance to round the result to a number that it can represent exactly.

At some point, it finds itself adding the .666... number to the .333... one, like so:

  00111110 1  .o10101010 10101010 10101011
+ 00111111 0  .10101010 10101010 10101011o
------------------------------------------
  00111111 1 (1).0000000 00000000 0000000x  # the x isn't in the final result

The leftmost bit is the sign, the next eight are the exponent, and the remaining bits are the fraction. In between the exponent and the fraction is an assummed "1" that is always present, and therefore not actually stored, as the normalized leftmost fraction bit. I've written zeroes that aren't actually present as individual bits as o.

A lot has happened here, at each step, the FPU has taken rather heroic measures to round the result. Two extra digits of precision (beyond what will fit in the result) have been kept, and the FPU knows in many cases if any, or at least 1, of the remaining rightmost bits were one. If so, then that part of the fraction is more than 0.5 (scaled) and so it rounds up. The intermediate rounded values allow the FPU to carry the rightmost bit all the way over to the integer part and finally round to the correct answer.

This didn't happen because anyone added 0.5; the FPU just did the best it could within the limitations of the format. Floating point is not, actually, inaccurate. It's perfectly accurate, but most of the numbers we expect to see in our base-10, rational-number world-view are not representable by the base-2 fraction of the format. In fact, very few are.

I would like to emphasize that the 1.0 answer is most definitely *not* the result of finally rounding in the output conversion routine. All current FPU units will in fact produce an exact `1.0` in this case. — DigitalRoss, May 19 '11 at 07:08

Marc Mutz - mmutz · Answer 3 · 2011-05-17T16:13:18.027

17

Let's do the math. For brevity, we assume that you only have four significant (base-2) digits.

Of course, since gcd(2,3)=1, 1/3 is periodic when represented in base-2. In particular, it cannot be represented exactly, so we need to content ourselves with the approximation

A := 1×1/4 + 0×1/8 + 1×1/16 + 1*1/32

which is closer to the real value of 1/3 than

A' := 1×1/4 + 0×1/8 + 1×1/16 + 0×1/32

So, printing A in decimal gives 0.34375 (the fact that you see 0.33333 in your example is just testament to the larger number of significant digits in a double).

When adding these up three times, we get

A + A + A
= ( A + A ) + A
= ( (1/4 + 1/16 + 1/32) + (1/4 + 1/16 + 1/32) ) + (1/4 + 1/16 + 1/32)
= (   1/4 + 1/4 + 1/16 + 1/16 + 1/32 + 1/32   ) + (1/4 + 1/16 + 1/32)
= (      1/2    +     1/8         + 1/16      ) + (1/4 + 1/16 + 1/32)
=        1/2 + 1/4 +  1/8 + 1/16  + 1/16 + O(1/32)

The O(1/32) term cannot be represented in the result, so it's discarded and we get

A + A + A = 1/2 + 1/4 + 1/8 + 1/16 + 1/16 = 1

QED :)

edited May 17 '11 at 16:13

answered May 17 '11 at 16:08

Marc Mutz - mmutz

24,485
12
80
90

2

But the fact that the results are exactly 1.0 is just luck. They could easily have been 1 or 2 LSB off, and his program would still have displayed 1, since by default, the precision of a floating point conversion is only 6. – James Kanze May 17 '11 at 17:57
1

Yes, this answer is incorrect, but the reason for the correct result is not "luck". In fact the addition of all three fractions WILL produce an exact 1.0 as I have explained, no matter how many decimals places you go ahead and print. – DigitalRoss May 19 '11 at 07:05
@DigitalRoss: care to explain where "this answer is incorrect"? – Marc Mutz - mmutz May 19 '11 at 09:55
@UpAndAdam read again. The core point is that the O(1/32) part of the result cannot be represented in the FP type due to the limited number of significant base-2 digits, thus producing an exact 1.0. – Marc Mutz - mmutz Sep 10 '13 at 13:50

leftaroundabout · Answer 4 · 2011-05-17T20:33:34.123

As for this specific example: I think the compilers are too clever nowadays, and automatically make sure a const result of primitive types will be exact if possible. I haven't managed to fool g++ into doing an easy calculation like this wrong.

However, it's easy to bypass such things by using non-const variables. Still,

int d = 3;
float a = 1./d;
std::cout << d*a;

will exactly yield 1, although this shouldn't really be expected. The reason, as was already said, is that the operator<< rounds the error away.

As to why it can do this: when you add numbers of similar size or multiply a float by an int, you get pretty much all the precision the float type can maximally offer you - that means, the ratio error/result is very small (in other words, the errors occur in a late decimal place, assuming you have a positive error).

So 3*(1./3), even though, as a float, not exactly ==1, has a big correct bias which prevents operator<< from taking care for the small errors. However, if you then remove this bias by just substracting 1, the floating point will slip down right to the error, and suddenly it's not neglectable at all any more. As I said, this doesn't happen if you just type 3*(1./3)-1 because the compiler is too clever, but try

int d = 3;
float a = 1./d;
std::cout << d*a << " - 1 = " <<  d*a - 1 << " ???\n";

What I get (g++, 32 bit Linux) is

1 - 1 = 2.98023e-08 ???

I get 0 with Linux 64bit and g++ :). But I am really interested how to find the error??? — Narek, May 18 '11 at 05:23
That's interesting. I didn't _expect_ it to work differently on 32 and 64 bit, but I was not sure... — leftaroundabout, May 18 '11 at 17:19
Have you tried it with bigger prime numbers then 3? How about nested fractions? — leftaroundabout, May 18 '11 at 17:24
You're using different versions of g++ and/or different optimization levels. Try making it `extern int d; extern float a` and compile a separate source file with just `int d = 3; float a = 1./d;` that you link into your executable. That should defeat any precision munging optimization its doing. — Chris Dodd, May 18 '11 at 22:03

score 0 · Answer 5 · answered May 18 '11 at 20:07

0

This works because the default precision is 6 digits, and rounded to 6 digits the result is 1. See 27.5.4.1 basic_ios constructors in the C++ draft standard (n3092).

answered May 18 '11 at 20:07

starblue

55,348
14
97
151

How computer does floating point arithmetic?

5 Answers5

Linked

Related