Calculating parity in parallel

Question

Consider the following code

typedef unsigned uint;

uint parity( uint64_t x )
    {
    uint32_t v = x ^ (x >> 32);
    v ^= v >> 16;
    v ^= v >> 8;
    v ^= v >> 4;
    v ^= v >> 2;
    return (uint)(v ^ (v >> 1)) & 1;
    }

Is there a way of radically reorganising this code to get a serious improvement due to instruction-level parallelism on say an Intel x86-64 machine?

GCC produced the following code

parity(unsigned long):
    mov     rax, rdi
    shr     rax, 32
    xor     eax, edi
    mov     edi, eax
    shr     edi, 16
    xor     eax, edi
    mov     edi, eax
    shr     edi, 8
    xor     eax, edi
    mov     edi, eax
    shr     edi, 4
    xor     eax, edi
    mov     edi, eax
    shr     edi, 2
    xor     eax, edi
    mov     edx, eax 
    shr     eax
    xor     eax, edx 
    and     eax, 1
    ret

Be wary of dual tagging with C and C++. They're different languages and often what is appropriate in one is not appropriate in the other. — Jonathan Leffler, Feb 05 '17 at 07:26
but this _type_ of data-reducing algorithm is just inherently one big serial dependency and there's not much I can do about it? (apart from using the hardware as in popcnt) — Cecil Ward, Feb 05 '17 at 07:27
If you're trying to get the parity of a long stream of integers, just XOR them together. Then you only need the to do one reduction at the end. That's also vectorizable. — Mysticial, Feb 05 '17 at 07:28
@Jonathan Leffler - agreed, but the algorithm is the same in D or in C++ or in C, and its more about getting the best out of the machine's ILP if you have any — Cecil Ward, Feb 05 '17 at 07:28
That's why you got a "be wary" rather than a "don't", which is the advice I'd normally give. This time, you can get away with it. Often, you won't. Be cautious — or wary. — Jonathan Leffler, Feb 05 '17 at 07:30
God people are fast on this forum, that's one reason why I love it so. — Cecil Ward, Feb 05 '17 at 07:30
Up to you. Like I said, this time it is OK, and you're unlikely to get downvotes because of the dual tagging. — Jonathan Leffler, Feb 05 '17 at 07:34
Aside: An idea I had to begin with: the change of type to 32-bits is only to save some bytes on the x86-64. I don't expect it to improve the speed at all, unless the saving of a few bytes might help make the code fit into the instruction cache — Cecil Ward, Feb 05 '17 at 07:36
The algorithm you've chosen is pretty serial. That said, I can't imagine you are calling this method _once_ or else you wouldn't care about the speed. Assuming you are calling it a lot, you can create a method that does the XOR-fold for several (disjoint, even) integers at once, and operates faster. I assume you've also looked at the [alternatives here](https://graphics.stanford.edu/~seander/bithacks.html#ParityNaive). The xor-folding one they show seems faster than yours by using a couple of tricks for the last couple lines. — BeeOnRope, Feb 07 '17 at 03:32
Yes, I had looked at that page, but I wanted to keep it simple as I only wanted to asked about whether things _have to_ be so serial. I really am not getting any of the wonderful n-way ILP that you can often get nowadays (quite often n=3, even). — Cecil Ward, Feb 09 '17 at 20:53

user5329483 · Answer 1 · 2017-02-06T18:03:30.860

-3

In the 32bit world I would write directly in assembler something like test eax,eax followed by SETPO EAX.

UPDATE 2017-02-06: @EOF is right, the test command sets the parity bit only according the lowbyte.

edited Feb 06 '17 at 18:03

answered Feb 05 '17 at 07:35

user5329483

1,260
7
11

That would be a hell of a good plan if you have no popcnt. Could be used on a 64-bit processor too. – Cecil Ward Feb 05 '17 at 07:39
actually a huge lot faster than popcnt too poss? – Cecil Ward Feb 05 '17 at 07:44
@CecilWard: As the first assembler instruction suggests: TEST it! :) – user5329483 Feb 05 '17 at 07:52
The parity-flag only considers the parity of the lowest eight bits in the result. – EOF Feb 06 '17 at 11:06
Thanks for that EOF, I didn't know that and I've never used tests on the parity flag writing x86 assembler otherwise I would have been bitten before now. – Cecil Ward Feb 09 '17 at 20:49

Calculating parity in parallel

1 Answers1