1

I'm trying to understand floating point arithmetic in GNURadio and started looking into their tests. The test generates random float input and random taps, then pass everything to the filter. Later it compares expected output and actual output using some margin.

There is a cryptic comment about that margin:

    // we use a sloppy error margin because on the x86 architecture,
    // our reference implementation is using 80 bit floating point
    // arithmetic, while the SSE version is using 32 bit float point
    // arithmetic.

I cannot find anywhere 80bit arithmetic in the source code. Can anyone help me? Or just explain why error margin depends on the taps size?

dernasherbrezon
  • 848
  • 7
  • 16
  • I believe the internal precision of the x87 floating-point architecture is 80-bit - see https://en.wikipedia.org/wiki/Extended_precision#IEEE_754_extended_precision_formats. – Oliver Charlesworth Jan 14 '18 at 22:26
  • @OliverCharlesworth: And some (Win32) compilers allow you to use extended (80 bit) precision too, as `long double`. – Rudy Velthuis Jan 14 '18 at 23:03

2 Answers2

1

On x86+87 even simply using double you get 80-bit precision for intermediate results because the FPU stack uses 80-bit floating point numbers internally.

If your code expects and depends on the rounding of 64-bit or 32-bit float math you can get surprises.

For example I've been hit quite a few times by something like x < y being true but after assigning z = x you may get z >= y (all vars declared double). This may happen if x ends up being allocated in a 80-bit FPU register and z instead is a real 64-bit floating point variable in memory.

g++ has a specific option to avoid these issues (-ffloat-store) that prevents the use of extra bits (but however slows down math-heavy code quite a bit).

6502
  • 112,025
  • 15
  • 165
  • 265
0

Note: Per comment from Marcus Müller below, GNURadio has not used 80-bit floating-point for some time, so the comment cited in the question is stale at best. So this answer only speaks to differences between 80-bit and 32-bit floating-point arithmetic in general and is not applicable to the current GNURadio implementation.

You have not shown source code, so the answer is speculative. Presumably, the reference implementation uses long double, and long double in the C++ implementation the author used is implemented with the 80-bit floating-point types built into Intel hardware. It would not look like 80-bit code in the sense of seeing any explicit extended-precision arithmetic; it would just look like C++ code with a long double type.

Meanwhile, the primary implementation is implemented with SSE code. SSE is the name of an Intel instruction subset that includes SIMD (Single Instruction Multiple Data) instructions that work on four 32-bit floating-point numbers at a time. That code would be obvious on appearance; it would be in assembly using SSE instructions, in C++ using compiler built-ins referring to SSE instructions, or possibly in C++ using compiler extensions to support vector types.

Naturally, the floating-point rounding errors that occur when using 32-bit floating-point arithmetic differ from the floating-point rounding errors that occur when using 80-bit floating-point arithmetic, so one would not expect results computed by the two different methods to be equal. The author has designed the tests to tolerate some differences in results.

It should be noted that although the code is of course targeted for the Intel architecture, not all C++ implementations implement long double with the 80-bit floating-point arithmetic. A C++ implementation might use 64-bit arithmetic. So, if you run the code using a C++ implementation different from the author’s, you may get different results. While 64-bit arithmetic generally has larger rounding errors than 80-bit arithmetic, it is likely that whatever tolerance the author set for differences between 32-bit and 80-bit arithmetic will also cover the differences between 32-bit and 64-bit arithmetic, so no adjustment may be needed.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • Sorry to disappoint, but the reference there uses 32 bit floats in a bog-normal C++ for loop, too. I don't know why Trondeau wrote that (and I doubt he remembers at this point), but as far as I know our SSE, SSE2, SSE4.1/2 and AVX accelerated code, the dotproducts (which is what is being tested here) are, for "normal" floating point values (ie, not NaN, not +- Inf, not -0) pretty much identical. – Marcus Müller Jan 19 '18 at 20:29
  • Note that since 2012, we moved forward: the accelerated (or naive) implementation of the math algorithms is no longer part of the FIR filter code itself, but will be linked in / compiled in from [VOLK](http://libvolk.org), the Vector Optimized Library of Kernels, which does their own tests (with which I'm not always 100% happy). – Marcus Müller Jan 19 '18 at 20:32