C Population Count of unsigned 64-bit integer with a maximum value of 15

Question

I use a population count (hamming weight) function intensively in a windows c application and have to optimize it as much as possible in order to boost performance. More than half the cases where I use the function I only need to know the value to a maximum of 15. The software will run on a wide range of processors, both old and new. I already make use of the POPCNT instruction when Intel's SSE4.2 or AMD's SSE4a is present, but would like to optimize the software implementation (used as a fall back if no SSE4 is present) as much as possible.

Currently I have the following software implementation of the function for 64bit (platform) mode:

int population_count64(unsigned __int64 w) {
    w -= (w >> 1) & 0x5555555555555555ULL;
    w = (w & 0x3333333333333333ULL) + ((w >> 2) & 0x3333333333333333ULL);
    w = (w + (w >> 4)) & 0x0f0f0f0f0f0f0f0fULL;
    return int((w * 0x0101010101010101ULL) >> 56);
}

So to summarize:

(1) I would like to know if it is possible to optimize this for the case when I only want to know the value to a maximum of 15.

(2) Is there a faster software implementation (for both Intel and AMD CPU's) than the function above (for unsigned 64bit integers)?

I believe `return int(w * 0x0101010101010101ULL) >> 56` will prematurely truncate the result of the multiplication to `int`, which may be only 32 bits wide. — j_random_hacker, Jun 02 '10 at 18:39
Other possible very minor optimisations include: (a) skipping the last step or two on some iterations if you always perform this on more than one 64-bit value at a time; (b) see if you can rearrange slightly to use the same constants more often -- these might then be able to go in registers, which *might* be faster (less instruction decoding time) than always using immediate values on some CPUs (benchmark and see). — j_random_hacker, Jun 02 '10 at 18:46
really? care to explain the truncate part? Remember this is in 64bit mode. — BitTwiddler1011, Jun 02 '10 at 20:11
You're casting the 64-bit result of the multiplication to `int`, which is 32-bit. This function should return zero, regardless of the input. I think you placed the closing paren on the last line wrong. — slacker, Jun 02 '10 at 20:26
@slacker: Actually it invoked UB due to having a shift greater than the width of the type... — R.. GitHub STOP HELPING ICE, Sep 02 '11 at 15:09

Jaap Weidemann · Accepted Answer · 2010-06-02T20:31:56.523

It is indeed possible to optimise your function for the "maximum 15" case. The following shaves off a few operations:


inline int population_count64_max15(unsigned __int64 w)
{
  w -= (w >> 1) & 0x5555555555555555ULL;
  w  = (w & 0x3333333333333333ULL) + ((w >> 2) & 0x3333333333333333ULL);

  return int((w * 0x1111111111111111ULL) >> 60);
}

Inlining the function (using the inline keyword as above) should also increase performance.

slacker · Answer 2 · 2010-06-02T20:09:41.743

If you're on a 32-bit machine, split w into two 32-bit words, calculate the popcount separately for each half, then add up. This will get rid of some unneeded operations that are required to synthesize 64-bit operations from 32-bit ones (shifts, mults...). This also allows for increased parallelism if you interleave the calculations.

If you're compiling 64-bit code, you may try this:

int popcnt64(uint64_t w)
{
   uint64_t w1 = (w & 0x2222222222222222) + ((w+w) & 0x2222222222222222);
   uint64_t w2 = (w >> 1 & 0x2222222222222222) + (w >> 2 & 0x2222222222222222);
   w1 = w1 + (w1 >> 4) & 0x0f0f0f0f0f0f0f0f;
   w2 = w2 + (w2 >> 4) & 0x0f0f0f0f0f0f0f0f;
   return (w1 + w2) * 0x0101010101010101 >> 57;
}

This contains more operations, but gives more opportunities of parallel execution to the CPU. On newer CPUs it should be slightly faster, on others it will be slightly slower.

Will this be faster or slower than the accepted answer on a 64 bit processor? What about a 32 bit processor? — jjxtra, Mar 21 '18 at 19:04

C Population Count of unsigned 64-bit integer with a maximum value of 15

2 Answers2

Linked