How can I avoid this float number rounding issue in C++?

Question

With below code, I get result "4.31 43099".

  double f = atof("4.31");
  long ff = f * 10000L;
  std::cout << f << ' ' << ff << '\n';

If I change "double f" to "float f". I get expected result "4.31 43100". I am not sure if changing "double" to "float" is a good solution. Is there any good solution to assure I get "43100"?

`Is there any good solution to assure I get "43100"?` Your application only uses the number `4.31`? What about other floating point numbers? — PaulMcKenzie, Jun 20 '14 at 15:36

bames53 · Answer 1 · 2014-06-20T23:00:04.960

You're not going to be able to eliminate the errors in floating point arithmatic (though with proper analysis you can calculate the error). For casual usage one thing you can do to get more intuitive results is to replace the built-in float to integral conversion (which does truncation), with normal rounding:

double f = atof("4.31");
long ff = std::round(f * 10000L);
std::cout << f << ' ' << ff << '\n';

This should output what you expect: 4.31 43100

Also there's no point in using 10000L, because no matter what kind of integral type you use it still gets converted to f's floating point type for the multiplication. just use std::round(f * 10000.0);

score 2 · Accepted Answer · answered Jun 20 '14 at 15:37

2

The problem is that floating point is inexact by nature when talking about decimal numbers. A decimal number can be rounded either up or down when converted to binary, depending on which value is closest.

In this case you just want to make sure that if the number was rounded down, it's rounded up instead. You do this by adding the smallest amount possible to the value, which is done with the nextafter function if you have C++11:

long ff = std::nextafter(f, 1.1*f) * 10000L;

If you don't have nextafter you can approximate it with numeric_limits.

long ff = (f * (1.0 + std::numeric_limits<double>::epsilon())) * 10000L;

I just saw your comment that you only use 4 decimal places, so this would be simpler but less robust:

long ff = (f * 1.0000001) * 10000L;

answered Jun 20 '14 at 15:37

Mark Ransom

299,747
42
398
622

Thank you. I love your solutions. – poordeveloper Jun 20 '14 at 15:51
1

Replacing the built-in truncation with a call to `std::round()` is simpler than trying to manipulate rounding this way. – bames53 Jun 20 '14 at 16:21
@bames53 my method is more correct if truncation is really what you want. Given that the number is limited to 4 digits though your answer gives the same results and it is simpler. – Mark Ransom Jun 20 '14 at 17:05
`1.1*f` has to be the worst possible way to indicate “up” to `nextafter`. It does not work for the 10 or so denormals around zero, including zero itself. It does the opposite of what it should for a negative `f`. And it is likely to be compiled in a multiplication. This is what `infinity()` is for (and if you are worried that the compilation platform may not have an infinity, what do you expect will happen when you multiply the largest finite numbers by 1.1?) – Pascal Cuoq Jun 22 '14 at 16:55
@PascalCuoq for negative numbers conversion to `int` will truncate towards zero, so you want `nextafter` to increment away from zero. Feel free to propose an alternate answer. – Mark Ransom Jun 23 '14 at 02:49

Eugene Podskal · Answer 3 · 2014-06-20T15:34:47.760

With standard C types - i doubt.

There are many values that cannot be represented in those bits - they actually demand more space to be stored. So floating-point processor just uses the closest possible.

Floating pointing numbers cannot store all the values you think it could - there is only limited amount of bits - you can't put more than 4 billion different values in 32 bits. And that's just the first restriction.

Floating point values(in C) are represented as: sign - one sign bit, power - bits which defines the power of two for the number, significand - the bits that actually make the number.

Your actual number is sign * significand * 2 inpowerof(power - normalization).

Double is 1bit of sign, 15 bits of power(normalized to be positive but that is not the point) and 48 bits to represent the value;

It is a lot but not enough to represent all the values, especially when they cannot be easily represented as finite sum of powers of two: like binary 1010.101101(101). For example it cannot represent precisely such values like 1/3 = 0.333333(3). That's the second restriction.

Try to read - decent understanding of advantages and disadvantages of floating point arithmetic may be very handy: http://en.wikipedia.org/wiki/Floating_point and http://homepage.cs.uiowa.edu/~atkinson/m170.dir/overton.pdf

score 0 · Answer 4 · answered Jun 20 '14 at 15:49

There have been some confused answers here! What is happening is this: 4.31 can't be exactly represented as either a single- or double-precision number. It turns out that the nearest representable single-precision number is a little more than 4.31, while the nearest representable double-precision number is a little less than 4.31. When a floating-point value is assigned to an integer variable, it is rounded towards zero (not towards the nearest integer!).

So if f is single-precision, f * 10000L is greater than 43100, so it is rounded down to 43100. And if f is double-precision, f * 10000L is less than 43100, so it is rounded down to 43099.

The comment by n.m. suggests f * 10000L + 0.5, which is I think the best solution.

How can I avoid this float number rounding issue in C++?

4 Answers4