Is it possible to enable link-time optimization while only disabling strict aliasing for some functions?

Question

My program conforms to the strict aliasing rule, except for one place: a compilation unit which contains hashing functions such as MurmurHash3, SpookyHash, etc. On x86 and x86_64, these hashing functions accept a const char *, cast them to uint32, and process data in blocks of 4 bytes. This makes them much faster compared to processing data byte-by-byte, but I believe this breaks the strict aliasing rule. Right now, I compile this compilation unit with -fno-strict-aliasing, while I compile the rest of the program with -fstrict-aliasing.

But I'm wondering what happens if I enable link-time optimization. As far as I know, GCC and Clang implement link-time optimization sort-of by storing the program source into .o files, so that at the linking phase the compiler knows the source code of the entire program. Is it then still possible to disable strict aliasing for the hashing functions only? Or do I now have to disable strict aliasing for the entire program? Or did I misunderstand things completely, and MurmurHash3/SpookyHash are in fact strict aliasing compliant?

Why can you not simply use `memcpy` to copy the bytes into a local variable of type `uint32`, and then operate on that local variable? This should be optimised to the equivalent of your code, but without violating any aliasing rule. — , Sep 10 '14 at 11:54
I wasn't sure whether the compiler would be able to optimize this. If it does then that would be great. But seeing Christoph's answer, it would seem that the compiler is unable to optimize it. — Hongli, Sep 10 '14 at 13:53
Christoph's answer contains a link to the code, and that code doesn't do what I described. When you want to treat buffers of varying sizes as series of `uint64_t` values, that code copies the whole buffer into an array of `uint64_t` (as part of a `union`), and what I'm describing is having one single `uint64_t` variable and repeatedly copying the specific 8 bytes that you want to read into that. — , Sep 10 '14 at 14:03
@Hongli: I have not tested the code with link-time optimizations (or poor-man's variant thereof, ie including the `.c` file), so I don't know if the compiler can do some more magic if it actually gets to see both function body and arguments... — Christoph, Sep 10 '14 at 15:47
@Christoph That's the maximum block size, but for data that is not an exact multiple of a block size, you will have one smaller block. But I shouldn't have pointed out the varying sizes, because it's not relevant to the point I'm making. Even with a fixed block size, the code doesn't do what I'm describing. The copy of into `buf` is a big hit on performance, and unnecessary. You don't need to `memcpy` more than one integer value at a time. — , Sep 10 '14 at 15:55
@hvd: Lazy person that I am, I did it the same way as Mr Jenkins, ie copying the whole block. I don't know if copying qwords one-at-a-time will be faster: Registers probably will be spilled anyway, but the compiler will have an easier time optimizing away the single variables and we might not gain anything from batching the memory access. You might be right, but I simply don't know without benchmarking... — Christoph, Sep 10 '14 at 16:10
@Christoph I do know, because I did measure. :) There's no need for register spills. GCC is capable of translating accesses to local variables that have been copied from a buffer, into memory accesses directly into that buffer. The local variable in the code does not necessarily translate to a register of the CPU. — , Sep 10 '14 at 16:16
@hvd: thanks for the information - I added a [note-to-self](https://bitbucket.org/cggaertner/spooky/issue/1/dont-memcpy-whole-blocks) - I'll get to it $whenever ;) — Christoph, Sep 10 '14 at 16:45

Christoph · Answer 1 · 2014-09-10T12:23:35.620

There are three things to take into account:

performance
portability
standard compliance

You will get best performance if you avoid copying data and can guarantee aligned access.

Unaligned access and aliasing are portability concerns. If you do decide to copy the data, this will take care of both. Otherwise, you have to adjust the algorithm to handle mis-aligned input data and guarantee that there is no competing access through pointers of incompatible type:

If you only access data through a single pointer type, violating effective typing rules will make your program non-conformant, but probably won't be a problem in practice, even if you do not pass -fno-strict-aliasing - which is where having unit tests comes in quite handy.

For SpookyHash, I actually have my own C99 version (which also fixes an off-by-one in V2 of the reference implementation). If you're fine with violating effective typing and your architecture supports unaligned access (or all input data is aligned), you may pass the -DSPOOKY_NOCOPY compiler flag. On my x86-64 machine, the performance gain was about 10-20% depending on input size.

score 6 · Answer 2 · answered Sep 10 '14 at 12:20

Right now, I compile this compilation unit with -fno-strict-aliasing, while I compile the rest of the program with -fstrict-aliasing.

You can do the same with link time optimizations. Just do not compile the specific object code with link time optimizations.

Example with clang (same with gcc):

 clang -flto -O3 -c a.c
 clang -O3 -fno-strict-aliasing b.c     # no -flto and with -fno-strict-aliasing
 clang -flto -O3 -c main.c
 clang a.o b.o main.o -o main

Basile Starynkevitch · Answer 3 · 2014-09-10T12:14:55.020

It should be possible (at least you could try) since recent GCC provides function specific option pragmas. You could try adding something like

 #pragma GCC optimize ("-fstrict-aliasing")

before your aliasing functions, and put

 #pragma GCC reset_options

after them.

Perhaps you need to wrap these with

#if __GNUC__ >= 4

and of course some #endif

Alternatively, use builder tricks (e.g. autoconf, cmake, or a sophisticated GNU make 4.0 rule, etc...) to define your own HAVE_GENUINE_GCC as 1 only for the genuine GCC compiler, and your own HAVE_GENUINE_CLANG as 1 for the genuine Clang/LLVM compiler, etc.... Or maybe detect that the above pragmas are understood on some sample code, and then define HAVE_WORKING_PRAGMA_GCC_OPTIMIZE as 1.

BTW, on GCC at least, -flto is not storing in object files a representation of the program source, but only a digested form of some GCC internal representations (like Gimple, etc...) obtained when compiling your source code. This is quite different!

PS. I did not try, so perhaps it is not that simple.

Unfortunately Clang doesn't support this. My software is used by users on various operating systems. Some of them use Clang (e.g. OS X and FreeBSD). — Hongli, Sep 10 '14 at 11:54
You could play build tricks (e.g. with `autoconf` or `cmake` ...) to define your own `HAVE_GNU_GCC` macro as 1 for genuine GCC.... — Basile Starynkevitch, Sep 10 '14 at 12:00

Is it possible to enable link-time optimization while only disabling strict aliasing for some functions?

3 Answers3