1

Considering 64 bit/8 byte/double IEEE floats that represent exact numbers (not approximations).

Im using JS and have access to the mantissa, exponent.

Im interested in the correct general use of floating point arithmetic (addition, subtraction, multiplication, division).

Is the following correct about floats:

  • You can only represent exact decimal numbers in binary floats where the denominator is a power of 2
  • It only makes sense to do calculations when both numbers are in the same "window of precision" or scale. E.g. very small numbers calculated with other very small numbers? Never very large numbers and very small numbers?

So how do I programatically detect the numbers are in the same window of precision/scale?

Would it be possible to find the exact number half way between a very large and very small number, or is that calculation stuck at (large/2) because ((large+small)/2) cancels out the smaller number?

zino
  • 1,222
  • 2
  • 17
  • 47
  • JavaScript is not a good language for this. What are you really trying to do? – Eric Postpischil Nov 28 '17 at 12:55
  • Im trying to allow inserting an item at any point in a SQLite database table that represents an ordered list. I want to be able to say newItem.order = ((prevItem.order+nextItem.order)/2) "insert between these two". I cannot update all the item orders as the table could be 2M rows which is 15s. Im considering using a float but the issue is the items can also be moved to new positions, so after many moves prevItem could be a very small number compared to nextItem. Is there a better way to achieve this? Thanks! – zino Nov 28 '17 at 13:15
  • It seems you are trying to use floating-point as a sort key so that you can easily insert new keys, such as being able to put 3.5 between 3 and 4, and 3.75 between 3.5 and 4, and 3.625 between 3.5 and 3.75, as new items are inserted. In this case, you might as well simply use integers. Number the original items something like 1000000, 2000000, 3000000, and so on, then insert 3500000, 3750000, 3625000, and so on. Of course, this may run out of bits on the high or the low end if unfortunate insertions arrive, but you have to deal with that whether you use integer or floating-point. – Eric Postpischil Nov 28 '17 at 13:24
  • "how do I programatically detect the numbers are in the same window of precision/scale?" --> `if (a < b && b < a*2)` is a good start. – chux - Reinstate Monica Nov 28 '17 at 14:54

1 Answers1

3

You can only represent exact decimal numbers in binary floats where the denominator is a power of 2

Yep! This follows trivially from the definition of a floating-point number. Of course, not all numbers with power-of-2 denominators are exactly representable as binary64 floats.

It only makes sense to do calculations when both numbers are in the same "window of precision" or scale. E.g. very small numbers calculated with other very small numbers? Never very large numbers and very small numbers?

When specifically considering addition and subtraction, this is roughly true. In particular I refer you to Sterbenz's Theorem, which is roughly: If you subtract one number from another and the two are within a factor of 2, the result is exact. However, that's only a sufficient condition, not a necessary one. More generally, addition is exact in situations where ((a+b) - a) - b ,evaluated in a floating point context without extended precision, produces zero. See Kahan's summation algorithm for more about this sort of thing.

For multiplication, it's actually pretty easy to check for error. Right shift each significand until its LSB is one, then multiply the two shifted significands, and if the result is 2^53 or greater, there's rounding. You can do the same sort of significand-based "trial operation" with addition/subtraction, but first you need to shift the larger significand by the difference in exponent (and you then need to right shift both significands until one has a 1 in the LSB).

For division, it's even easier. You've noted that only rational numbers with a power-of-2 denominator are exactly representable. So if you ever divide a nonzero number by a number which is not a power of two (that is, which has a nonzero mantissa), there's rounding, unless you can exactly divide the right-shifted significands as integers.

Note that in all of this I've been ignoring underflow and overflow, and sweeping denormalized inputs under the rug. If you want to do this correctly, you really need to get a good book on floating point calculation. I can recommend Higham's "Accuracy and Stability of Numerical Algorithms".

Sneftel
  • 40,271
  • 12
  • 71
  • 104
  • Another possibility without access to hardware inexact flag, is to evaluate fma(a,b,-a*b)==0 => multiplication is exact, and fma(b,a/b,-a)==0 => division is exact, unfortunately it seems that JS does not have a cheap FusedMultiplyAdd, only emulated ones... – aka.nice Nov 28 '17 at 13:54
  • And implication may not be true in case of gradual underflow, we would have to pre-scale a and b... – aka.nice Nov 28 '17 at 13:58
  • @aka.nice Clever! – Sneftel Nov 28 '17 at 14:57