4

I use a population count (hamming weight) function intensively in a windows c application and have to optimize it as much as possible in order to boost performance. More than half the cases where I use the function I only need to know the value to a maximum of 15. The software will run on a wide range of processors, both old and new. I already make use of the POPCNT instruction when Intel's SSE4.2 or AMD's SSE4a is present, but would like to optimize the software implementation (used as a fall back if no SSE4 is present) as much as possible.

Currently I have the following software implementation of the function for 64bit (platform) mode:

int population_count64(unsigned __int64 w) {
    w -= (w >> 1) & 0x5555555555555555ULL;
    w = (w & 0x3333333333333333ULL) + ((w >> 2) & 0x3333333333333333ULL);
    w = (w + (w >> 4)) & 0x0f0f0f0f0f0f0f0fULL;
    return int((w * 0x0101010101010101ULL) >> 56);
}

So to summarize:

(1) I would like to know if it is possible to optimize this for the case when I only want to know the value to a maximum of 15.

(2) Is there a faster software implementation (for both Intel and AMD CPU's) than the function above (for unsigned 64bit integers)?

  • 1
    I believe `return int(w * 0x0101010101010101ULL) >> 56` will prematurely truncate the result of the multiplication to `int`, which may be only 32 bits wide. – j_random_hacker Jun 02 '10 at 18:39
  • Other possible very minor optimisations include: (a) skipping the last step or two on some iterations if you always perform this on more than one 64-bit value at a time; (b) see if you can rearrange slightly to use the same constants more often -- these might then be able to go in registers, which *might* be faster (less instruction decoding time) than always using immediate values on some CPUs (benchmark and see). – j_random_hacker Jun 02 '10 at 18:46
  • really? care to explain the truncate part? Remember this is in 64bit mode. – BitTwiddler1011 Jun 02 '10 at 20:11
  • 1
    You're casting the 64-bit result of the multiplication to `int`, which is 32-bit. This function should return zero, regardless of the input. I think you placed the closing paren on the last line wrong. – slacker Jun 02 '10 at 20:26
  • 1
    @slacker: Actually it invoked UB due to having a shift greater than the width of the type... – R.. GitHub STOP HELPING ICE Sep 02 '11 at 15:09

2 Answers2

5

It is indeed possible to optimise your function for the "maximum 15" case. The following shaves off a few operations:


inline int population_count64_max15(unsigned __int64 w)
{
  w -= (w >> 1) & 0x5555555555555555ULL;
  w  = (w & 0x3333333333333333ULL) + ((w >> 2) & 0x3333333333333333ULL);

  return int((w * 0x1111111111111111ULL) >> 60);
}


Inlining the function (using the inline keyword as above) should also increase performance.

2

If you're on a 32-bit machine, split w into two 32-bit words, calculate the popcount separately for each half, then add up. This will get rid of some unneeded operations that are required to synthesize 64-bit operations from 32-bit ones (shifts, mults...). This also allows for increased parallelism if you interleave the calculations.

If you're compiling 64-bit code, you may try this:

int popcnt64(uint64_t w)
{
   uint64_t w1 = (w & 0x2222222222222222) + ((w+w) & 0x2222222222222222);
   uint64_t w2 = (w >> 1 & 0x2222222222222222) + (w >> 2 & 0x2222222222222222);
   w1 = w1 + (w1 >> 4) & 0x0f0f0f0f0f0f0f0f;
   w2 = w2 + (w2 >> 4) & 0x0f0f0f0f0f0f0f0f;
   return (w1 + w2) * 0x0101010101010101 >> 57;
}

This contains more operations, but gives more opportunities of parallel execution to the CPU. On newer CPUs it should be slightly faster, on others it will be slightly slower.

slacker
  • 2,142
  • 11
  • 10
  • Will this be faster or slower than the accepted answer on a 64 bit processor? What about a 32 bit processor? – jjxtra Mar 21 '18 at 19:04