Curious result from the gcc linker behaviour around -ffast-math

Question

I've noticed an interesting phenomenon around flags to the compiler linker affecting the running code in ways I cannot understand.

I have a library that presents different implementations of the same algorithm in order to test the run speed of those different implementations.

Initially, I tested the situation with a pair of identical implementation to check the correct thing happened (both ran at roughly the same speed). I begun by compiling the objects (one per implementation) with the following compiler flags:

-g -funroll-loops -flto -Ofast -Werror

and then during linking passed gcc the following flags:

-Ofast -flto=4 -fuse-linker-plugin

This gave a library that ran blazingly fast, but curiously was reliably and repeatably ~7% faster for the first object that was included in the arguments during linking (so either implementation was faster if it was linked first).

so with:

gcc -o libfoo.so -O3 -ffast-math -flto=4 -fuse-linker-plugin -shared support_obj.os obj1.os obj2.os -lm

vs

gcc -o libfoo.so -O3 -ffast-math -flto=4 -fuse-linker-plugin -shared support_obj.os obj2.os obj1.os -lm

the first case had the implementation in obj1 running faster than the implementation in obj2. In the second case, the converse was true. To be clear, the code is identical in both cases except for the function entry name.

Now I removed this strange link-argument-order difference (and actually sped it up a bit) by removing the -Ofast flag during linking.

I can replicate mostly the same situation by changing -Ofast to -O3 -ffast-math, but in that case I need to supply -ffast-math during linking, which leads again to the strange ordering speed difference. I'm not sure why the speed-up is maintained for -Ofast but not for -ffast-math when -ffast-math is not passed during linking, but I can accept it might be down to the link time optimisation passing the relevant info in one case but not the other. This doesn't explain the speed disparity though.

Removing -ffast-math means it runs ~8 times slower.

Is anybody able to shed some light on what might be happening to cause this effect? I'm really keen to know what might be going on to cause this funny behaviour so I can not accidentally trigger it down the line.

The run speed test is performed in python using a wrapper around the library and timeit, and I'm fairly sure this is doing the right thing (I can twiddle orders and things to show the python side effects are negligible).

I also tested the library for correctness of output, so I can be reasonably confident of that too.

Could you make an [mcve] or show us the code of your library? It's hard to find the source of a weird behaviour without having a chance to see the code that has it. — fuz, Dec 18 '15 at 15:35
Note that although `-Ofast` does turn on all optimizations covered by `-O3` and `-ffast-math`, the documentation does not foreclose the possibility that it also enables other optimizations that are not covered by those -- perhaps even some that are not available any other way. To the extent that your question is "why is behavior different for `-Ofast` than for `-O3 -ffast-math`", that probably has something to do with it. — John Bollinger, Dec 18 '15 at 16:08
The GNU compiler collection does not include a linker, but uses the system linker (typically GNU `ld` one from `binutils` — too honest for this site, Dec 18 '15 at 16:08
Have a look at the generated assembler code. It is well possible due to how the functions are inlined by LTO, how memory is organised (e.g. cache alignment), register allocation, etc. A complete analysis is imo OT here, as it likely requires a code-review. — too honest for this site, Dec 18 '15 at 16:13
Bear with me, I'm working on a minimal complete example. The problem is the level of integration into support code that makes it a little hard to tease apart. — Henry Gomersall, Dec 18 '15 at 16:23
Ok, so this is very hard to pin down - changing the code changes the manifestation of the problem, so making a minimal complete example a difficult task. I've moved all initialisation memory allocations to give aligned data and also set `-mtune=native`, which seems to largely eliminate any difference (though still ~2% or so). Does the `mtune` set the stack alignment? What parameters could lead to identical bits of c not running at the same speed? (It's worth noting the assembly is not quite identical in both cases, though it's close). — Henry Gomersall, Dec 18 '15 at 19:51

score 0 · Answer 1 · answered Dec 19 '15 at 04:19

too long for a comment so posted as an answer:

Due to the risk of obtaining incorrect results in math operations, I would suggest not using it.

using -ffast_math and/or -Ofast can lead to incorrect results, as expressed in these excerpts from the gcc manual:

option:-ffast-math Sets the options:

-fno-math-errno,
-funsafe-math-optimizations,
-ffinite-math-only,
-fno-rounding-math,
-fno-signaling-nans and
-fcx-limited-range.

This option causes the preprocessor macro __FAST_MATH__ to be defined.

This option is not turned on by any -O option besides -Ofast since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications. "

option: -Ofast

Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.

Well, all issues can reasonably be accepted if one understands them - many of the corner cases of IEEE floating point can be safely ignored in many algorithms, but that still requires `-funsafe-math-optimizations`. — Henry Gomersall, Dec 21 '15 at 09:58

Curious result from the gcc linker behaviour around -ffast-math

1 Answers1