For a particular project I'm sticked with gcc and a 32 bits 12.04 LTS Ubuntu running on i7 Core supporting up to AVX SIMD instructions.
Due to the 32 bits OS, I apparently can't use the AVX instructions running on 256 bits. I have access to SSE4.2 instructions using 128bits and POPCNT can run on 16, 32 and 64 bits data, so looked promising. But I’ve tried several ways to provide 64 bits data to POPCNT without success. GCC 4.6.3 returns
- “unknown register name” for r8 to r15,
- “bad register name” for rax-rdx,
- when trying to provide mm registers or to give to my inline assembly function some uint64 or long long that are affected to registers in such a
way:
uint64 a, b;
__asm__ volatile (“POPCNT %1, %0;”
:”=r”(b)
:”r”(a)
:
)
gcc tells “operand type mismatch for popcnt”,
- and writing POPCNTQ leads to “invalid instruction suffix for popcnt”.
Would have been so nice if POPCNT was supporting 128 bits xmm registers...
Any workaround to apply POPCNT on 64 bits data in assembly?
PS: discussion about SSSE3 popcount using shuffle compared to SSE4 POPCNT performance found its conclusion here http://danluu.com/assembly-intrinsics/ and was due only to the fact that using intrinsics doesn't always provide efficient assembly code. It's nice using intrinsics to optimize quickly C/C++ code and if that's enough to reach the needs, fine. But else I obtained a nearly 30% performance improvement coding popcount using shuffle in assembly compared to intrinsics one.