So you have code that enables -mpopcnt
and uses __builtin_popcount
if that will be fast. Otherwise you use something different, because your custom solution beats gcc's implementation?
Keep in mind that host != target in some cases. Build-time CPU detection is not appropriate for making binaries that have to run on other machines. e.g. Linux distros making binaries. Cross-compiling for is also a thing, and is commonly done when targeting an embedded system or an old slow system.
Maybe write a custom C program that returns the result you want.
On x86, you could just use the result of runtime CPU detection: run the CPUID
instruction and check if popcnt is supported. It's probably best not to unconditionally run the popcnt
instruction, since processes that run an illegal instruction generate a syslog entry on some modern distros (e.g. Ubuntu).
With recent GNU C extensions, the easiest way to do that is: __builtin_cpu_init()
and __builtin_cpu_supports("popcnt")
, saving you the trouble of manually decoding the CPUID results.
You could then fall back to a micro-benchmark of a __builtin_popcount
against your custom macro, and take whichever is faster. That might be useful even on non-x86 architectures where your macros beat gcc's implementation. (e.g. an architecture that always has a popcnt instruction available). Then you'd have to handle the case where you should use __builtin_popcount
but not build with -mpopcnt