0

Question: what is the highest value that can be accurately represented to one decimal place by IEEE-754 32-bit floating point data?

Background: I've found this question which asks: Which is the first integer that an IEEE 754 float is incapable of representing exactly?

...and that all makes sense, but I'm not sure how to translate the method given there to my question.

My application is: I'm writing a totaliser function which totalises weights to one decimal place, storing them in a 32-bit float. At some point, if it is not reset, this totaliser will begin to lose accuracy. I want to determine what that point is so I can either alert a user that the totaliser is no longer accurate, or to automatically reset it.

ASForrest
  • 377
  • 6
  • 19
  • 1
    There is really no such thing as one decimal place in floating-point. There are *binary* places, and they are incommensurable with decimal places. Your question doesn't actually have an answer. – user207421 Jun 18 '18 at 01:56
  • 2
    Are you asking about exact representation, as in the integer question? The only one decimal place numbers in [0,1) that can be represented exactly are 0.0 and 0.5. – Patricia Shanahan Jun 18 '18 at 02:45

2 Answers2

2

Based on the description of the function you want to write, the question you intend to ask seems to be:

What is the largest x such that: For any list L of numbers whose sum does not exceed x and each of which can each be written as a positive decimal numeral with one digit after the decimal point, the 32-bit binary floating-point sum of the numbers, when converted to a decimal numeral with a single decimal point, equals the sum of the numbers?

We could calculate x. However, that is the wrong approach for the function you want to write. A better approach is to take each weight, multiply it by ten to produce an integer, and then accumulate the sum of those integers. That sum could be accumulated with integer arithmetic, although floating-point would suffice up to the point where integers can no longer be represented exactly.

My intuition is this would allow accumulating to a higher limit than the first approach, as it incurs no rounding error and so can continue to use the full significand of the floating-point format (if floating-point is used for accumulation), whereas the first approach incurs rounding errors and so may fail sooner.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • Thanks for your answer. I'm an industrial machine programmer, not a computer scientist or a mathematician, so the fine detail of the math is a little over my head, but I understand your rephrasing of the question, and yes, that's what I'm after. The suggestion to do it with integer arithmetic is a valid one and I may go this way. – ASForrest Jun 18 '18 at 04:09
1

I ran a test as follows:

Set Test_INT to 0
Set Test_FLOAT to 0
Set Counter to 0
Set STOP to False

While Not STOP(
1. Increment Counter by 1
2. Divide Counter by 10, store result in Test_FLOAT
3. Multiply Counter by 10, store result in Test_INT
4. If Test_Int <> Counter, STOP = True
)

The idea was that each time the counter increments, I divide it by 10, store it in a float, and then multiply it by 10. If the float was able to correctly display the value to 1 DP, then the multiplication will result in the same value as before the divide, and the loop will continue. If the float has to round up or down, the value when multiplied by 10 will be different, halting the loop.

The result was that the loop halted with an integer value of 10485763. Confirming this, I entered 1048576.2 into the float register, and it immediately updated to show 1048576.3.

According to this test, my answer would therefore be 1048576.1.

ASForrest
  • 377
  • 6
  • 19
  • 1
    Consider a set of values *x0*, *x1*,… *xn* that are each one-tenth of an integer and that, when converted to `float`, yield a result that is slightly less than the original value. Although each may convert back to the original value because it is sufficiently close, when they are added together, the total error may produce a sum small enough that, when converted back from `float` to a decimal numeral, it is less than the proper sum. – Eric Postpischil Jun 18 '18 at 11:44