0

Consider the following code

typedef unsigned uint;

uint parity( uint64_t x )
    {
    uint32_t v = x ^ (x >> 32);
    v ^= v >> 16;
    v ^= v >> 8;
    v ^= v >> 4;
    v ^= v >> 2;
    return (uint)(v ^ (v >> 1)) & 1;
    }

Is there a way of radically reorganising this code to get a serious improvement due to instruction-level parallelism on say an Intel x86-64 machine?

GCC produced the following code

parity(unsigned long):
    mov     rax, rdi
    shr     rax, 32
    xor     eax, edi
    mov     edi, eax
    shr     edi, 16
    xor     eax, edi
    mov     edi, eax
    shr     edi, 8
    xor     eax, edi
    mov     edi, eax
    shr     edi, 4
    xor     eax, edi
    mov     edi, eax
    shr     edi, 2
    xor     eax, edi
    mov     edx, eax 
    shr     eax
    xor     eax, edx 
    and     eax, 1
    ret
too honest for this site
  • 12,050
  • 4
  • 30
  • 52
Cecil Ward
  • 597
  • 2
  • 13
  • 1
    What is `(uint)v`? – David Ranieri Feb 05 '17 at 07:20
  • 1
    If you have SSE4.2: `return _mm_popcnt_u64(x) & 1;` – Mysticial Feb 05 '17 at 07:21
  • I forgot about popcnt - many thanks. – Cecil Ward Feb 05 '17 at 07:24
  • 3
    Be wary of dual tagging with C and C++. They're different languages and often what is appropriate in one is not appropriate in the other. – Jonathan Leffler Feb 05 '17 at 07:26
  • but this _type_ of data-reducing algorithm is just inherently one big serial dependency and there's not much I can do about it? (apart from using the hardware as in popcnt) – Cecil Ward Feb 05 '17 at 07:27
  • If you're trying to get the parity of a long stream of integers, just XOR them together. Then you only need the to do one reduction at the end. That's also vectorizable. – Mysticial Feb 05 '17 at 07:28
  • @Jonathan Leffler - agreed, but the algorithm is the same in D or in C++ or in C, and its more about getting the best out of the machine's ILP if you have any – Cecil Ward Feb 05 '17 at 07:28
  • @Mysticial - vg point, worthy tip – Cecil Ward Feb 05 '17 at 07:30
  • 1
    That's why you got a "be wary" rather than a "don't", which is the advice I'd normally give. This time, you can get away with it. Often, you won't. Be cautious — or wary. – Jonathan Leffler Feb 05 '17 at 07:30
  • God people are fast on this forum, that's one reason why I love it so. – Cecil Ward Feb 05 '17 at 07:30
  • @Jonathan - agreed – Cecil Ward Feb 05 '17 at 07:31
  • Should we take the C++ off as being unhelpful? – Cecil Ward Feb 05 '17 at 07:32
  • Up to you. Like I said, this time it is OK, and you're unlikely to get downvotes because of the dual tagging. – Jonathan Leffler Feb 05 '17 at 07:34
  • Aside: An idea I had to begin with: the change of type to 32-bits is only to save some bytes on the x86-64. I don't expect it to improve the speed at all, unless the saving of a few bytes might help make the code fit into the instruction cache – Cecil Ward Feb 05 '17 at 07:36
  • The algorithm you've chosen is pretty serial. That said, I can't imagine you are calling this method _once_ or else you wouldn't care about the speed. Assuming you are calling it a lot, you can create a method that does the XOR-fold for several (disjoint, even) integers at once, and operates faster. I assume you've also looked at the [alternatives here](https://graphics.stanford.edu/~seander/bithacks.html#ParityNaive). The xor-folding one they show seems faster than yours by using a couple of tricks for the last couple lines. – BeeOnRope Feb 07 '17 at 03:32
  • Yes, I had looked at that page, but I wanted to keep it simple as I only wanted to asked about whether things _have to_ be so serial. I really am not getting any of the wonderful n-way ILP that you can often get nowadays (quite often n=3, even). – Cecil Ward Feb 09 '17 at 20:53

1 Answers1

-3

In the 32bit world I would write directly in assembler something like test eax,eax followed by SETPO EAX.

UPDATE 2017-02-06: @EOF is right, the test command sets the parity bit only according the lowbyte.

user5329483
  • 1,260
  • 7
  • 11