0

I am currently developing the firmware for a medical device where a lot of difficult mathematical operations is involved. The target processor supports floating point operation in hardware, but only float32 (aka single).

To simulate the behavior and prove the correctness of my formulae and code, I have ported the relevant / mathematical part of the firmware to the GCC toolchain in Linux (gcc 6.3.0, libc6 2.24), double-checking that float32 is used everywhere and that no compiler switch is used which could reduce the precision or standards compatibility of mathematical operations; notably, there is none of -ffast-math or its friends.

Now, it has turned out that I am getting unexpected results for a small set of input parameters. I have tracked down the problem and have come to the conclusion that libm calculates a wrong result for arctan (to be precise: atan2) for a very small set of input parameters.

For example, if I have

#include <math.h>

#define C_RAD2DEG (57.29577951308f)

int main(void)
{
  float f_Temp = C_RAD2DEG * atan2f(0.713114202f, 0.665558934f);
}

f_Temp is computed to be 46.9755516f, where the right result would be 46.975548972f.

Please note that I am generally aware about the issues with different floating point data types, rounding errors and the like.

However, my feeling is that the error shown above is too high by an order of magnitude even given the low precision of float32, and unfortunately, for the calculations which follow, that error is too much.

Furthermore, only a very small subset of possible input parameters to the atan2 function is affected by the problem.

Could anybody please shortly explain if this is a bug in libm or if it is just due to the imprecision of float32 and the great number of sequential operations needed to compute atan2?

Binarus
  • 4,005
  • 3
  • 25
  • 41
  • 3
    If your system is using [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) for floating point values, then note that the significant number of decimals is about 7. Any decimal after that is really just noise. And yes the problem is most likely due to the rounding problems. – Some programmer dude Feb 18 '19 at 11:06
  • 2
    Not sure what "order of magnitude" you are talking about. The error, if any, is in the last significant bit. 46.9755516f is represented as 0x423BE6F7, 46.975548972f is 0x423BE6F6 (both decimal representations have too many decimal digits, about 3-4 more than needed for 32-bit IEEE floats). – n. m. could be an AI Feb 18 '19 at 13:02
  • 1
    Except if you have huge speed problems, even if your hardware supports only floats you should be able to use wider operations / computations, it's just slower as it requires multiple operations! And then you could output the integer part separated from the decimal part of the answer, for example... – B. Go Feb 18 '19 at 13:12
  • 1
    The error you describe is 56 parts in a billion, so it is surprising you say that error is too much. Could you explain further? – Eric Postpischil Feb 18 '19 at 13:13
  • 1
    Just curious, what angle do you compute with such precision? The difference between the two values is 0.0095 arc seconds. Hubble is able to resolve up to about 0.1 arc second. The highest precision angular measurement systems have nominal resolution of about 0.005 arc seconds (the real accuracy is an order of magnitude lower). – n. m. could be an AI Feb 18 '19 at 13:20
  • OK, thanks for all the help. I didn't see in the first place that the two results I mentioned are adjacent. I'll have to think about how I could solve this problem. – Binarus Feb 18 '19 at 13:49
  • To all who are curious: I have a system where a so-called axis transformation is involved. This means that the system is fed with coordinate values and computes corresponding motor positions, and vice versa. When calculating coordinate values from any motor position tuple, then taking those coordinate values and again computing the motor positions, both motor position tuples must be the same. This is where I am running into problems. The geometry of this device is weird, leading to unacceptable results in the process outlined above at very few motor positions due to the accuracy issue. – Binarus Feb 18 '19 at 14:36
  • So you are relying on direct and inverse trig functions in your math library being the *exact* inverses of each other? Good luck but I won't hold my breath waiting for your product to be ready. IMNSHO you need a redesign ASAP. – n. m. could be an AI Feb 18 '19 at 15:24
  • No, not *exact*, just sufficiently precise. Some (relative) angle errors coming out from the forward transformation might be amplified by the backward transformation by factors up to 100 or a little more due to the weird geometry and some singularities, so I'll "just" have to make sure that I get the results from the forward transformation with more precision; I currently estimate that 8 bits more would be sufficient. However, I can't sacrifice too much cycles, so I'll have to stick with float32 as far as possible. This one will take some time ... – Binarus Feb 18 '19 at 15:33

1 Answers1

4

The number you report as the observed result, 46.9755516f, corresponds to the float value 46.975551605224609375.

The number you report as the expected result, 46.975548972f, corresponds to the float value 46.97554779052734375.

These are adjacent float values, meaning they differ by 1 unit of least precision (ULP). (Their difference is 3.814697265625e-06, which is the value of the least significant bit in the float significand when the most significant bit has value 32, as it does for numbers around 47.) This is the smallest possible amount by which a float can change at that scale.

Generally, the math library routines are difficult to implement, and nobody has implemented all of them with correct rounding (rounding to the representable number that is nearest the exact mathematical value) and known bounded run time. A few ULP of error is not unusual in the trigonometric routines.

Even if the libc code you used provided a correctly rounded result, converting it from radians to degrees introduces two more rounding errors (converting 180/π to a representable value and multiplying by it). It is not reasonable to expect the final result to be the float that is nearest the ideal mathematical result; you should expect several ULP of error.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312