Tweaking MIT's bitcount algorithm to count words in parallel?

Question

I want to use a version of the well known MIT bitcount algorithm to count neighbors in Conway's game of life using SSE2 instructions.

Here's the MIT bitcount in c, extended to count bitcounts > 63 bits.

int bitCount(unsigned long long n)
{
unsigned long long uCount;

uCount = n – ((n >> 1) & 0×7777777777777777)
           - ((n >> 2) & 0×3333333333333333)
           - ((n >> 3) & 0×1111111111111111);
return ((uCount + (uCount >> 4))
& 0x0F0F0F0F0F0F0F0F) % 255;
}

Here's a version in Pascal

function bitcount(n: uint64): cardinal;
var ucount: uint64;
begin
  ucount:= n - ((n shr 1) and $7777777777777777)
             - ((n shr 2) and $3333333333333333) 
             - ((n shr 3) and $1111111111111111);
  Result:= ((ucount + (count shr 4)) 
           and $0F0F0F0F0F0F0F0F) mod 255;
end;

I'm looking to count the bits in this structure in parallel.

  32-bit word where the pixels are laid out as follows.
  lo-byte         lo-byte neighbor
  0 4 8 C  048C   0 4 8 C 
   +---------------+
  1|5 9 D  159D   1|5 9 D 
   |               |
  2|6 A E  26AE   2|6 A E  
   +---------------+
  3 7 B F  37BF   3 7 B F 
 |-------------|            << slice A
   |---------------|        << slice B
     |---------------|      << slice C

Notice how this structure has 16 bits in the middle that need to be looked up. I want to calculate neighbor counts for each of the 16 bits in the middle using SSE2. In order to do this I put slice A in XMM0 low-dword, slice B in XXM0-dword1 etc.
I copy XMM0 to XMM1 and I mask off bits 012-456-89A for bit 5 in the low word of XMM0, do the same for word1 of XMM0, etc. using different slices and masks to make sure each word in XMM0 and XMM1 holds the neighbors for a different pixel.

Question
How do I tweak the MIT-bitcount to end up with a bitcount per word/pixel in each XMM word?

Remarks
I don't want to use a lookup table, because I already have that approach and I want to test to see if SSE2 will speed up the process by not requiring memory accesses to the lookup table.

An answer using SSE assembly would be optimal, because I'm programming this in Delphi and I'm thus using x86+SSE2 assembly code.

Just to clarify - you want the popcnt for each of 8 x 16 bit words in a 128 bit vector ? E.g. input = { 0, 1, 2, 3, 4, 5, 6, 7 }, output = { 0, 1, 1, 2, 1, 2, 2, 3 }. Is that correct ? — Paul R, Jun 21 '11 at 22:15
I'm curious about your results - is the SSE2 method faster? And how does it compare to an implementation using the POPCNT assembly instruction (introduced with SSE4, which is available in Delphi btw)? Also this may be interesting: http://wm.ite.pl/articles/sse-popcount.html — PhiS, Jun 27 '11 at 20:08
@phiS, haven't fully tested it yet, lookup looks to be faster for now, that's because the cache is always hot because of a small test-set. If cache-misses start factoring in SSE should win. Problem is that the lookup generates the output in one go and SSE needs some translation to create the output. SSE3 have a `pshufb` instruction that can do a 4bit+overflow = 4.5bit lookup. This could be a winner. — Johan, Jun 29 '11 at 13:51
For Conway's "Game of Life", you don't want a bit counting algorithm. You want to implement the whole algorithm using pure binary logic, then implement it for 64 cells in parallel using unsigned 64 bit integers (or 128 or 256 bits in parallel using vector registers). — gnasher729, Apr 29 '14 at 15:57

score 3 · Accepted Answer · answered Jun 22 '11 at 07:57

The MIT algorithm would be tough to implement in SSE2, since there is no integer modulus instruction which could be used for the final ... % 255 expression. Of the various popcnt methods out there, the one that lends itself to SSE most easily and efficiently is probably the first one in Chapter 5 of "Hackers Delight" by Henry S. Warren, which I have implemented here in C using SSE intrinsics:

#include <stdio.h>
#include <emmintrin.h>

__m128i _mm_popcnt_epi16(__m128i v)
{
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x5555)), _mm_and_si128(_mm_srli_epi16(v, 1), _mm_set1_epi16(0x5555)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x3333)), _mm_and_si128(_mm_srli_epi16(v, 2), _mm_set1_epi16(0x3333)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x0f0f)), _mm_and_si128(_mm_srli_epi16(v, 4), _mm_set1_epi16(0x0f0f)));
    v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x00ff)), _mm_and_si128(_mm_srli_epi16(v, 8), _mm_set1_epi16(0x00ff)));
    return v;
}

int main(void)
{
    __m128i v0 = _mm_set_epi16(7, 6, 5, 4, 3, 2, 1, 0);
    __m128i v1;

    v1 = _mm_popcnt_epi16(v0);

    printf("v0 = %vhd\n", v0);
    printf("v1 = %vhd\n", v1);

    return 0;
}

Compile and test as follows:

$ gcc -Wall -msse2 _mm_popcnt_epi16.c -o _mm_popcnt_epi16
$ ./_mm_popcnt_epi16 
v0 = 0 1 2 3 4 5 6 7
v1 = 0 1 1 2 1 2 2 3
$

It looks like around 16 arithmetic/logical instructions so it should run at around 16 / 8 = 2 clocks per point.

You can easily convert this to raw assembler if you need to - each intrinsic maps to a single instruction.

Thanks Paul, this looks like a tight approach. I was wondering about the `mod` as well. — Johan, Jun 22 '11 at 08:23
There's also a good comparison table for a few algorithms at http://www.strchr.com/crc32_popcnt (both posted here included as well). — FrankH., Aug 30 '11 at 17:44

Tweaking MIT's bitcount algorithm to count words in parallel?

1 Answers1

Linked