Inaccuracy of floating point numbers causes calculations error

Question

My code solve quadratic equation (in game logic tick) to solve the task - find satellite tick offset along an orbit of moveable object in space. And I've encountered errors in discriminant (farther D) calculation. I'll remind: D = b^2 - 4ac. As it is orbit of big object, my a,b & c are numbers of order like:

1E+8 1E+12 1E+16

Accordingly, b^2 is number of order about 1E+24, & 4ac is about 1E+24 too. BUT this equation root are much less numbers, 'cos they are just coordinates on scene. So roots are about 1E+3 ... 1E+4.

The problem (updated - concretized): because of floating of values of floats (& doubles) b^2 & 4ac have inaccuracy, which is small enough (relatively to these very big numbers [measured absolute inaccuracy has order about 1E+18]), BUT as D == the difference among them, so when D is (from bigger values side) to value of order as mentioned inaccuracy is (1E+18), its value begin to fluctuating in range about +1E+18 .. -1E+18 (i.e. fluctuating range is wider than [-100% .. +100%] of actual value!

Obviously, this fluctuation causes wrong (even wrong directed) tick offsets. And my satellite begin to wobble (and it's awful)).

Note: when I said "when D is approaching to zero" actually D is still far enough from zero, so I can't just assign it to zerro in this range of values.

I've considered using of fixed-point calculations (which could rescue me from my problem). But, it is not suggested to use in tick logic ('cos they are much less optimized, and probably will be very slow).

My question: How can I try to solve my problem? May be there are some common solutions for my case? Thanks a lot for any advise!

PS: All formulas are good (I calulated all in excel & got proper results, when floats in my code failed).

PPS: I tried doubles insted of floats (not all calculations, but my a, b & c are doubles now) & problem did not disappear.

Updated: I made a mistake - confused order of the orders of a, b & c. So "b^2 is number of order about 1E+16, & 4ac is about 1E+28" was wrong. Now it fixed to 1E+24 both. (I've wrote this to the already written comments were understandable)

Update#2: "The problem" section is concretized.

Update#3: Real case of values (for reference): Note: as "accurate values" here I mark values calculated manually in Excel.

a == 1.43963872E+8
b == 3.24884062357827E+12
c == 1.83291898112689E+16

//floats:
b^2 == 1.05549641E+25
4ac == 1.05549641E+25
D == 0.0
root:
y = -1.12835273E+4

//doubles:
b^2 == 1.0554965397412443E+25
4ac == 1.0554964543412880E+25
D == 8.5399956328598733E+17
roots:
y1 == -1.1280317962726038E+4
y2 == -1.1286737079932651E+4

//accurate values:
b^2 == 1.05549653974124E+25
4ac == 1.05549645434129E+25
D == 8.53999563285987E+17
roots: 
y1 == -1.128031796E+4 
y2 == -1.128673708E+4

It looks like Ok with doubles, but it's not, 'cos here I gave only part of calculations - here I start from same a, b & c values, but their real values in my code are also calculated. And contain inaccuracity, which yields problems even with doubles.

If you're relying on accuracy you shouldn't be using `float` *at all*. — Mark Ransom, Dec 11 '16 at 15:44
[Game physics for beginners](http://brm.io/game-physics-for-beginners/). — IInspectable, Dec 11 '16 at 15:46
You need to adjust what it means to "approach zero". Here, you are subtracting two values on the order of 1E28 and getting results on the order of 1E18, which is a relative error of only 1E-10. It's the relative error you should be checking, not the absolute error. (Also: You should read up on numerical analysis, which deals with problems like this.) — Raymond Chen, Dec 11 '16 at 16:00
Fixed-point is not magical. You only think it would help you because you have not tried it. — Pascal Cuoq, Dec 11 '16 at 22:00
"b^2 is number of order about 1E+16" shouldn't it be of order 1E+24? — QuarterlyQuotaOfQuotes, Dec 12 '16 at 00:26
@PascalCuoq, I think, with my own class of number type I'll be able to give enough range of integral part (my intermediate results've max order about 1E+28 - so two int64 will gave me more than enough about 1E+38 [actually, I could use "scaling" a problem before big-order calculation start to avoid the need of 2nd of these two int64]) & also enough preceision for fractional part (even, say, about 1E-3 would be enough for my task). Where did I go wrong in my expectations? — user1234567, Dec 12 '16 at 08:37
@QQoQ, indeed I've confused the orders of a,b,c. Fixed. Thank you, it was important remark (1E+16 & 1E+28 became now 1E+24 both) — user1234567, Dec 12 '16 at 09:18
@MarkRansom, if you mean my calculations should use doubles from the 1st operation, then question is: Is any sense in using doubles at all for me, if the game engine is able give me object positions only as floats (which are the basic values for my farther calculations)? — user1234567, Dec 12 '16 at 10:49
@RaymondChen, by "`D` is approaching to zero" I mean that roots becomes closer. Actually roots of this quadratic equation are 1 of 2 2D coordinates of 2 points of 2 circles intersection (1 of circles is the orbit, 2nd - my auxiliary geometry). Don't be confused about "2D" appears in 3D task - I solve this subtask in plane of the orbit. — user1234567, Dec 12 '16 at 10:50
I answered this one there : http://stackoverflow.com/questions/4503849/quadratic-equation-in-ada — Alexandre C., Dec 12 '16 at 11:27
It would be helpful if you gave some explicit examples of troublesome `a`, `b` and `c` in the question. — Mark Dickinson, Dec 12 '16 at 13:49
@MarkDickinson, I've added "Update#3" at the end of post with real case. — user1234567, Dec 12 '16 at 17:05
@user3241228: Thanks. With those values, it looks as though `b*b-4*a*c` is actually negative, so the corresponding quadratic polynomial has no roots. — Mark Dickinson, Dec 12 '16 at 18:19
@user3241228 With the data shown, I get a negative discriminant: `b*b=1.0554961345600001e+25 4*a*c=1.0554969523516320e+25 d=-8.17791632e+18`. I would suggest giving the numbers according to the format `%23.16e` at minimum, or use the `a` format to print them as hexadecimal floating-point numbers. — njuffa, Dec 12 '16 at 18:24
@MarkDickinson, njuffa, sorry, I've confused my a bit different logged results a little. Now I've fixed "update#3". — user1234567, Dec 12 '16 at 19:38
@user3241228 Now the numbers work, but your problem remains *ill-conditioned*, and accurate computation cannot fix that. Using the code from my answer and `double` computation, I get: `d=8.5399956208164339e+17; x1=-1.1280317962728303e+4; x2=-1.1286737079930386e+4;`. This is very close to the most accurate solutions actually representable as a `double`, which are: `x1=-1.1280317962728301e+4; x2=-1.1286737079930388e+4` — njuffa, Dec 12 '16 at 20:32

score 5 · Answer 1 · edited Apr 13 '17 at 12:19

Using the standard quadratic formula can give "catastrophic cancellation", where the subtraction of 2 numbers of the same magnitude gives a loss of precision.

The trick is to use an alternative formulation in such cases, see here: https://math.stackexchange.com/a/311397

UPDATE: I misread your question. I think the problem is more likely to be the sensitivity of your result to the input numbers. Let's pick, say

a = 4e8
b = -1e12
c = 6.2e14

for which the solutions are ~1138 and 1361. Now if you compute the relative derivatives. I can do this in Julia by automatic differentiation using the ForwardDiff.jl package:

julia> import ForwardDiff.Dual

julia> function p(a,b,c)
    D = sqrt(b^2-4*a*c)
    (-b+D)/(2a), (-b-D)/(2a)
end

julia> p(a,Dual(b,b),c)
(Dual(1361.803398874989,15225.424859373757),Dual(1138.196601125011,-12725.424859373757))

julia> p(Dual(a,a),b,c)
(Dual(1361.803398874989,-8293.614129124373),Dual(1138.196601125011,5793.614129124373))

julia> p(a,b,Dual(c,c))
(Dual(1361.803398874989,-6931.8107302493845),Dual(1138.196601125011,6931.8107302493845))

The results here are the two solutions and their scaled derivatives (i.e. (df/dx)*x). Note that they are all of the order of O(10000), so if the input is in error by 0.000001%, the output will be in error by 0.1%.

The only solution here is to reformulate your problem so that it isn't so sensitive to the input values.

Thank you, @Simon, looks like you gave right name to my problem. And if so, it's a great deal! - 'cos I could search for solution in this area now. But concrete solution by your link is not for my case (Only the case when |4ac| is small in comparison with |b| is considered there, but my `b^2` & `4ac` are of the same order (at least at point where the problem begins to appear). But, thank you nevertheless! — user1234567, Dec 12 '16 at 13:47
Relative sensitivities are not that huge. Now, the only problem appears to be that the input variables are badly scaled. Catastrophic cancellation in the computation of delta does not impact the (relative) accuracy of the solution (to be compared to, say, using the naive formula with 4ac << b^2). — Alexandre C., Dec 13 '16 at 19:34

score 2 · Answer 2 · edited May 23 '17 at 11:47

2

See my answer on that question : Quadratic equation in Ada

The trick is to always use

x1 = (-b - sign(b) * sqrt(b^2 - 4ac)) / 2a

as the first root, and use

x1 * x2 = c / a

to find the second one. This way, you hedge yourself against the case where 4ac << b^2 and -b + sqrt(delta) exhibits catastrophic cancellation.

If your alleged problem is that b^2 and 4ac have the same magnitude, then delta is actually small compared to b and you have no roundoff problem and you should perhaps rescale your problem (both solutions are very close to -b/2a).

edited May 23 '17 at 11:47

Community

1
1

answered Dec 12 '16 at 11:29

Alexandre C.

55,948
11
128
197

Yes, my `b^2` & `4ac` are of the same order & so, described solution is not for my case (as I understod). Shortly: my problem appears when `D` actual value is of same order, as `b^2` & `4ac` floating representation inaccuracy is. – user1234567 Dec 12 '16 at 12:17
1

@user3241228: No, this is also for your case, since computing stably is something you should always do. Now in your case : b^2 ~ 4ac, and both solutions are `(-b +/- sqrt(delta)) / 2a ~ -b / 2a`. If this is a huge or very small number, then you should rescale your problem, and no amount of numerical trickery will do something for you. – Alexandre C. Dec 12 '16 at 12:21

score 1 · Answer 3 · edited May 23 '17 at 11:54

C++ has a standard math library function fma() that offers an easy way to make the computation of the roots of a quadratic equation as accurate as possible within a given floating-point type via a robust computation of the discriminant d = √(b² - 4ac):

/*
  Compute a*b-c*d with error < 1.5 ulp

  Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller, 
  "Further Analysis of Kahan's Algorithm for the Accurate Computation of 2x2 Determinants". 
  Mathematics of Computation, Vol. 82, No. 284, Oct. 2013, pp. 2245-2264
*/
T diff_of_products (T a, T b, T c, T d)
{
    T w = d * c;
    T e = fma (-d, c, w);
    T f = fma (a, b, -w);
    return f + e;
}

/* George E. Forsythe, "How Do You Solve a Quadratic Equation"
   Stanford University Technical Report No. CS40 (June 16, 1966)
*/ 
T a, b, c;
T d = diff_of_products (b, b, 2*a, 2*c);
T x1 = 2*c / (-b - sqrt (d));
T x2 = 2*c / (-b + sqrt (d));

The fused multiply-add operation (FMA) implemented by fma() maps to a single hardware instruction on most modern processor architectures. As FMA computes the full, unrounded, double-width product prior to the addition it is used to compute the error of a product accurately.

As Simon Byrne alluded to in his answer, the specific problem at hand is ill-conditioned, and accurate computation cannot fix this, only a reformulation of the underlying math can.

I'd give it another 3 - 4 years before I'd say "most" processors having FMA. I still find a lot of Nehalem and Sandy Bridge boxes in use. — Mysticial, Dec 12 '16 at 17:54
@Mysticial I stated "most *modern* processor *architectures*, not "most processors in existence today". FMA support can be found on x86 (since Haswell), PowerPC, ARM, and GPUs. If necessary, the FMA here can be replaced with a bit of double-T computation. — njuffa, Dec 12 '16 at 17:59

Inaccuracy of floating point numbers causes calculations error

3 Answers3