0

For a particular project I'm sticked with gcc and a 32 bits 12.04 LTS Ubuntu running on i7 Core supporting up to AVX SIMD instructions.

Due to the 32 bits OS, I apparently can't use the AVX instructions running on 256 bits. I have access to SSE4.2 instructions using 128bits and POPCNT can run on 16, 32 and 64 bits data, so looked promising. But I’ve tried several ways to provide 64 bits data to POPCNT without success. GCC 4.6.3 returns

  • “unknown register name” for r8 to r15,
  • “bad register name” for rax-rdx,
  • when trying to provide mm registers or to give to my inline assembly function some uint64 or long long that are affected to registers in such a

way:

uint64 a, b;
__asm__ volatile (“POPCNT %1, %0;”
            :”=r”(b)
            :”r”(a)
            :
        )

gcc tells “operand type mismatch for popcnt”,

  • and writing POPCNTQ leads to “invalid instruction suffix for popcnt”.

Would have been so nice if POPCNT was supporting 128 bits xmm registers...

Any workaround to apply POPCNT on 64 bits data in assembly?

PS: discussion about SSSE3 popcount using shuffle compared to SSE4 POPCNT performance found its conclusion here http://danluu.com/assembly-intrinsics/ and was due only to the fact that using intrinsics doesn't always provide efficient assembly code. It's nice using intrinsics to optimize quickly C/C++ code and if that's enough to reach the needs, fine. But else I obtained a nearly 30% performance improvement coding popcount using shuffle in assembly compared to intrinsics one.

chus
  • 1,577
  • 15
  • 25
user3581220
  • 11
  • 1
  • 3
  • I might be mistaken, but what exactly would RAX and R15 mean in 32 bits mode? They're the names of GP 64 bit registers, which you by definition do not have in 32 bits mode. The wide registers are vector "XMM" registers. And while `"r"(a)` appears to be an innocent syntax, it does require that `a` fits in a GP register. – MSalters Jan 26 '15 at 14:37
  • and note that you're using smartquotes which are invalid characters in C and C++, so that won't even compile – phuclv Jan 27 '15 at 04:10

4 Answers4

2

popcnt is an integer instruction. As such, in 32 bit mode you can't use it with 64 bit operands. You will need to compute the popcnt for the two halves and add them together. This is what all clang versions I have tested do for the builtin. However, I couldn't get any gcc version to use the popcnt instruction. So while generally the builtin is recommended, in this case inline asm might be better.

Jester
  • 56,577
  • 4
  • 81
  • 125
  • Well, this is awkward that I can process up to 128 bits packed data on 32 bits OS using SSEx and that assembly POPCNT can't make the work on 64 bits data, even packed :-s – user3581220 Jan 23 '15 at 13:57
  • @user3581220 why should it work on 64-bit values in 32-bit mode when there's not even a 64-bit register there? – phuclv Jan 26 '15 at 15:04
  • Because there are 128 bits xmm registers that are accessible on 32-bits OS (considering values as packed wouldn't have been a big deal). – user3581220 Jan 26 '15 at 19:05
  • @user3581220 Intel decided to make XMM registers are accessible in all operating modes of the processor, but AMD decided to only allow the full 64 bits of the integer registers to be accessed in 64-bit long mode. – Ross Ridge Jan 27 '15 at 00:27
  • @user3581220 XMM registers are not for a big 128-bit number but for storing multiple values at the same time. They have different usage as XMM is not a general purpose register. Also the use of 64-bit registers require REX prefix which takes the opcode for inc and dec. That's why you can't use single-byte inc/dec in 64-bit mode. The reason is that there's almost no codepoint left for a prefix and opcode in x86 anymore – phuclv Jan 27 '15 at 04:07
  • @RossRidge possibly AMD had no choice. They must choose between discarding some in-use instructions in 32-bit mode (which breaks a lot of things) or not allow to access 64-bit registers in that mode – phuclv Jan 27 '15 at 04:16
2

64 bit POPCOUNT is not supported on 32 bit systems because

The REX prefix is only available in long mode. (not in 32 bit OS)

hence the

and writing POPCNTQ leads to “invalid instruction suffix for popcnt”.

see here: http://www.felixcloutier.com/x86/POPCNT.html (quote below)

Opcode          Instruction         Op/En   64-Bit Mode  Compat/Leg Mode    Description
F3 0F B8 /r     POPCNT r16, r/m16   RM      Valid        Valid           POPCNT on r/m16
F3 0F B8 /r     POPCNT r32, r/m32   RM      Valid        Valid           POPCNT on r/m32
F3 REX.W 0F B8 /r POPCNT r64,r/m64  RM      Valid        N.E.            POPCNT on r/m64

A workaround would be to split the 64/128 bit into two/four 32 bit instructions:

; a=uint_64, 64 bit operand, little endian
popcount eax, dword ptr [a]
popcount edx, dword ptr [a+4]
add eax, edx
xor edx, edx      ; for first mov below
mov dword ptr [b], edx      ; not neccessary, only due to 64 target op (will there ever be 2^64 bits set???)
mov dword ptr [b+4], eax

EDIT: 64 bit operand size version of (binary) HammingDistance in MASM32 code:

Hamming_64 PROC word1:QWORD , word2: QWORD
  mov ecx, dword ptr [word1]
  mov edx, dword ptr [word1+4]
  xor ecx, dword ptr [word2]
  xor edx, dword ptr [word2+4]
  popcnt eax, ecx 
  popcnt ebx, edx
  add eax, ebx   ; returns distance in EAX
  ret
Hamming_64 ENDP
zx485
  • 28,498
  • 28
  • 50
  • 59
  • I had previously seen this document on a mirror site but I wanted to be sure nobody found a way to use the 64bits version. – user3581220 Jan 25 '15 at 20:55
  • Well, my current assembly method is actually doing a XOR on two 64 Bytes vectors then processing the Hamming distance so I've tried to replace shuffle by 32 bits POPCNT in two different ways: a) writing back the XOR result from xmm to an aligned array before processing POPCNT, b) shifting the XOR result xmm register by 4 Bytes and placing it in a 32bits exx register before processing POPCNT on that register. Arrenging the registers use within the code to take into account instructions latency and throughput, I couldn't manage to improve the shuffle method (actually I lost between 5 and 15%). – user3581220 Jan 25 '15 at 21:09
  • I'm still not sure what you're trying to achieve or what algorithm you're trying to implement. Anyway, I added a 32 bit version to the post, that calculates the Hamming distance of two 64 bit values(QWord) without using xmm registers. Pretty straight forward and should be very fast. – zx485 Jan 26 '15 at 14:28
  • I tried it too. Difference with popcount using SSSE3 shuffle is only1-2%. – user3581220 Jan 28 '15 at 23:04
  • However this time I have tested running on each core with OpenMP one thread for the 32-bit XOR and POPCOUNT using 32-bit GP registers and another thread for the 128-bit XOR and SSSE3 shuffle popcount using xmm registers. One core process 200.000 vectors in a for loop split for the threads first as [0, 100.000[ and [100.000, 200.000[ , then as for(i=0; i<200.000; i+=2) and for(i=1; i<200.000; i+=2). There was no gain compared to letting OpenMP run SSE4 POPCNT in two threads for a core. – user3581220 Jan 28 '15 at 23:23
1

I don't know if there is a 32 bit popcnt instruction, but I would bet that you can't use a 64 bit popcnt in 32 bit code. Try declaring a and b as uint32_t. BTW uint64_t is standard C, uint64 isn't.

gnasher729
  • 51,477
  • 5
  • 75
  • 98
  • Yes, there is. But I'll miss a great part (all?) of the improvement compared to my current SSSE3 assembly implementation using shuffle. – user3581220 Jan 23 '15 at 13:49
  • My mistake about uint64_t. Wrote it just to show how I was passing arguments to assembly function registers. – user3581220 Jan 23 '15 at 13:50
0

After implementing the 32 bits POPCNT using assembly, it looks like there is no real improvement compared to the SSSE3 shuffle assembly method. As I was suspecting, only the 64 bits POPCNT version can almost double the speed.

user3581220
  • 11
  • 1
  • 3