simple floating-point numbers lose precision

Question

I'm using Delphi XE2 Update 3. There are precision issue with even the simplest of floating-point numbers (like 3.7). Given this code (a 32-bit console app):

program Project1;

{$APPTYPE CONSOLE}
{$R *.res}

uses System.SysUtils;

var s: Single; d: Double; x: Extended;
begin
  Write('Size of Single  -----  ');  Writeln(SizeOf(Single));
  Write('Size of Double  -----  ');  Writeln(SizeOf(Double));
  Write('Size of Extended  ---  ');  Writeln(SizeOf(Extended));  Writeln;

  s := 3.7;  d := 3.7;  x := 3.7;

  Write('"s" is ');                  Writeln(s);
  Write('"d" is ');                  Writeln(d);
  Write('"x" is ');                  Writeln(x);                 Writeln;

  Writeln('Single Comparison');
  Write('"s > 3.7"  is  ');          Writeln(s > 3.7);
  Write('"s = 3.7"  is  ');          Writeln(s = 3.7);
  Write('"s < 3.7"  is  ');          Writeln(s < 3.7);           Writeln;

  Writeln('Double Comparison');
  Write('"d > 3.7"  is  ');          Writeln(d > 3.7);
  Write('"d = 3.7"  is  ');          Writeln(d = 3.7);
  Write('"d < 3.7"  is  ');          Writeln(d < 3.7);           Writeln;

  Writeln('Extended Comparison');
  Write('"x > 3.7"  is  ');          Writeln(x > 3.7);
  Write('"x = 3.7"  is  ');          Writeln(x = 3.7);
  Write('"x < 3.7"  is  ');          Writeln(x < 3.7);           Readln;
end.

I get this output:

Size of Single  -----  4
Size of Double  -----  8
Size of Extended  ---  10

"s" is  3.70000004768372E+0000
"d" is  3.70000000000000E+0000
"x" is  3.70000000000000E+0000

Single Comparison
"s > 3.7"  is  TRUE
"s = 3.7"  is  FALSE
"s < 3.7"  is  FALSE

Double Comparison
"d > 3.7"  is  TRUE
"d = 3.7"  is  FALSE
"d < 3.7"  is  FALSE

Extended Comparison
"x > 3.7"  is  FALSE
"x = 3.7"  is  TRUE
"x < 3.7"  is  FALSE

You can see extended is the only type that evaluates correctly. I thought precision was only an issue when using a complex floating-point number like 3.14159265358979323846, not something as simple as 3.7. The issue when using single kind of makes sense. But why doesn't double work?

This should give you a [`short answer`](http://edn.embarcadero.com/article/22507). Here's a human [`readable article`](http://rvelthuis.de/articles/articles-floats.html) and a [`detailed story`](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html). — TLama, May 15 '14 at 00:14
3.7 isn't simple. 3.75 is. Or 3.71875 f.i. if you want to be a little more closer, or 3.69921875... — Sertac Akyuz, May 15 '14 at 00:27
I can't believe this is not a duplicate. I did a quick search but didn't find anything. — Graymatter, May 15 '14 at 00:51
Please don't think that you can just use extended and suddenly all will be well — David Heffernan, May 15 '14 at 04:25
@David - Agreed. Everyone's comments and answers have helped me understand a lot more about the nuances of floating-point numbers and operations. `Extended` is certainly not a magic solution. — James L., May 15 '14 at 05:53
@DavidHeffernan, sometimes the extended precision is needed, despite the fact that it leads to slower execution. There are many numerical methods that benefits from the extra precision. — LU RD, May 15 '14 at 08:00
@LURD I'm not disputing that sometimes it is useful. Although I do think such cases are exceptionally rare. — David Heffernan, May 15 '14 at 08:02
@DavidHeffernan, rare perhaps, but still the majority of my applications needs them. — LU RD, May 15 '14 at 08:06
@LURD I'm curious as to why that would be so. When I ported to 64 bit, I had a couple of scenarios where I used extended. I analysed them and simply re-cast the algorithm so that it would work in double. What is it about your algos that require extended? — David Heffernan, May 15 '14 at 08:08
@DavidHeffernan, The Gauss-Jordan elimination primarily. It can be unstable with double precision in some cases. — LU RD, May 15 '14 at 08:31
@LURD To my mind the way to deal with that is to avoid Gauss-Jordan and use an algorithm that does not suffer from that instability. Or is there some reason why you have to use Gauss-Jordan? — David Heffernan, May 15 '14 at 08:33
@DavidHeffernan, that code was written almost 30 years ago. Haven't crossed my mind that a better method was available. — LU RD, May 15 '14 at 08:47
@LURD Most likely there is a better solution. Can't say what it is since I don't know the problem. — David Heffernan, May 15 '14 at 08:54

David Heffernan · Accepted Answer · 2014-05-15T08:04:58.937

Required reading: What Every Computer Scientist Should Know About Floating-Point Arithmetic, David Goldberg.

The issue is not one of precision. Rather the issue is one of representability. First of all, let us re-cap that floating point numbers are used to represent real numbers. There are an infinite quantity of real numbers. Of course, the same can be said of integers. But the difference here is that within a particular range, there are a finite number of integers but an infinite number of real numbers. Indeed as was originally shown by Cantor, any finite interval of real numbers contains an uncountable number of real values.

So it is clear that we cannot represent all real numbers on a finite machine. So, which numbers can we represent? Well, that depends on the data type. Delphi floating point data types use binary representation. The single (32 bit) and double (64 bit) types adhere to the IEEE-754 standard. The extended (80 bit) type is an Intel specific type. In binary floating point a representable number has the form k2ⁿ where k and n are integers. Note that I am not claiming that all numbers of this form are representable. That is not possible because there are an infinite quantity of such numbers. Rather my point is that all representable numbers are of this form.

Some examples of representable binary floating point numbers include: 1, 0.5, 0.25, 0.75, 1.25, 0.125, 0.375. Your value, 3.7, is not representable as a binary floating point value.

What this means in relation to your code is that none of it is doing what you expect it to do. You are hoping to compare against the value 3.7. But instead you are comparing against the nearest exactly representably value to 3.7. As a matter of implementation detail, this nearest exactly representably value is in the context of extended precision. Which is why it appears that the version using extended does what you expect. However, do not take this to mean that your variable x is equal to 3.7. In fact it is equal to the nearest representable extended precision value to 3.7.

Rob Kennedy's most useful website can show you the closest representable values to a specific number. In the case of 3.7 these are:

3.7 = + 3.70000 00000 00000 00004 33680 86899 42017 73602 98112 03479 76684 57031 25
3.7 = + 3.70000 00000 00000 17763 56839 40025 04646 77810 66894 53125
3.7 = + 3.70000 00476 83715 82031 25

These are presented in the order extended, double, single. In other words these are the values of your variables x, d and s respectively.

If you look at these values, and compare them with the closest extended to 3.7 you will see why your program produces the output that it does. Both the single and double precision values here are greater than the extended. Which is what your program told you.

I don't want to make any blanket recommendations as to how to compare floating point values. The best way to do that always depends very critically on the specific problem. No blanket advice can be usefully given.

Excellent explanation. As always, you do more than answer. Thank you. — James L., May 15 '14 at 08:06

Ken White · Answer 2 · 2014-05-15T13:28:30.813

8

Short answer: 0.7 cannot be represented exactly (binary floating point values are always fractions with denominator that is a power of 2.); the precision of the data type you're storing it in (and the one the compiler selects for the type of the constant you're comparing them to) can affect the representation of that number and have an effect on the comparison.

Moral: Never directly compare two floating point values for equality unless they're exactly the same data type and assigned the same exact value.

Obligatory link: What Every Computer Scientist Should Know About Floating-Point Arithmetic

Another link that might be helpful is to Delphi's Math.SameValue function, that allows you to compare two floating point values for approximate equality depending on a specific allowable delta (difference).

edited May 15 '14 at 13:28

answered May 15 '14 at 02:08

Ken White

123,280
14
225
444

2

It's not that the number is not even. Representable in binary floating point means being of the form k/2^n. Other than that I think you've got it spot on. – David Heffernan May 15 '14 at 04:26
@David: Thanks for the edit. Regarding the point you make, do you have a suggestion for phrasing it better that would be understandable by non-mathematicians (and new programmers) in plain language? – Ken White May 15 '14 at 12:33
I'm not sure there's really any way to avoid at least a little maths. Perhaps the closest might be to say the binary floating point values are always fractions with denominator that is a power of 2. – David Heffernan May 15 '14 at 12:37
1

@David: That should work. I've edited accordingly (I couldn't quite come up with a better term). Thanks for the assist. (Credits given in this comment and the revision notes.) – Ken White May 15 '14 at 13:29
+1, This is a very good answer too. Thanks for your expertise and the links to the articles. – James L. May 15 '14 at 22:51

simple floating-point numbers lose precision

2 Answers2

Linked