5

So I have been trying to wrap by head around the relation between the number of significant digits in a floating point number and the relative loss of precision, but I just can't seem to make sense of it. I was reading an article earlier that said to do the following:

  1. Set a float to a value of 2147483647. You will see that its value is actually 2147483648
  2. Subtract 64 from the float and you will see that the operation is correct
  3. Subtract 65 from the float and you will see that you actually now have 2147483520, meaning that it actually subtracted 128.

So why is this 128 when there are 10 significant digits? I understand how floats are stored (1 bit for sign, 8 bits for exponent, 23 bits for mantissa) and understand how you will lose precision if you assume that all integers will automatically find exact homes in a float data structure, but I don't understand where the 128 comes from. My intuition tells me that I'm on the right track, but I'm hoping that someone may be able to clear this up for me.

I initially thought that the distance between possible floats was 2 ^ (n-1) where n was the number of significant digits, but this did not hold true.

Thank you!

MoarCodePlz
  • 5,086
  • 2
  • 25
  • 31
  • 23 bits of mantissa isn't 10 digits, it's slightly less – Martin Beckett Aug 10 '11 at 04:25
  • 2
    23 bits mean `23 * ln(2) / ln(10)` decimal digits ~ `6.92` digits. That's a bit more than "slightly" less. Makes sense, if you know that 2147483647 = 2^31 - 1. – Rudy Velthuis Aug 10 '11 at 19:56
  • Oops, `24 * ln(2) / ln(10) ~ 7.225`, so 7 digits. – Rudy Velthuis Aug 10 '11 at 22:47
  • Your example uses the value 0x80000000 spanning 32 binary bits, or, 6 bits more than can be stored in a mantissa. Thus you see the value is "quantized" to the nearest multiple of 64. In your example it rounds down (subtracting 65 is the same as subtracting 128), though it's a detail which depends on the assembly code. Internally the FPU uses an 80-bit representation instead of 32-bit when operating on a single value, to avoid precision loss. But the MSVC compiler in "fast" floating point model will use SSE code which is strictly 32-bits and can exhibit some loss. – MichaelsonBritt Apr 19 '18 at 05:06

2 Answers2

4

The distance between two floating point numbers depends on the exponent. The smaller the exponent, the smaller the difference between one floating point number and the next. The next thing to consider is that the exponent stored in floating point numbers is a binary exponent, not a decimal exponent, so in the case of floating point numbers, decimal precision is less important than binary precision of the number. Figure 9.1 of this document explains the concept pretty well.

Peter O.
  • 32,158
  • 14
  • 82
  • 96
3

The "distance" between two adjacent floating point numbers is 2^(1-n+e), where e is the true exponent and n the number of bits in the mantissa (AKA significand). The exponent stored is not the true exponent, it has a bias. For IEEE-754 floats, this is 127 (for normalized numbers). So, as Peter O said, the distance depends on the exponent.

Rudy Velthuis
  • 28,387
  • 5
  • 46
  • 94
  • I'm trying to understand the IEEE 754 format, I've a couple of doubts. For IEEE 754 binary32, n = 24, so the distance should be 2^(e-23). Is this the resolution of the format for a given e? Is this also called the [ULP](http://en.wikipedia.org/wiki/Unit_in_the_last_place#cite_note-1) for the numbers within [2^e, 2^(e+1)]? Please correct me if I'm wrong. – legends2k Aug 28 '15 at 06:51
  • For normalized numbers, the distance is the difference one bit in the significand makes. This is 2^(e-23) for 32 bit floats, indeed. – Rudy Velthuis Aug 28 '15 at 08:27
  • Is this resolution also called the ULP? – legends2k Aug 28 '15 at 09:43
  • Yes, that is what is generally called the ULP (unit of least precision, IIRC). – Rudy Velthuis Aug 28 '15 at 10:01