I want to use a version of the well known MIT bitcount algorithm to count neighbors in Conway's game of life using SSE2 instructions.
Here's the MIT bitcount in c, extended to count bitcounts > 63 bits.
int bitCount(unsigned long long n)
{
unsigned long long uCount;
uCount = n – ((n >> 1) & 0×7777777777777777)
- ((n >> 2) & 0×3333333333333333)
- ((n >> 3) & 0×1111111111111111);
return ((uCount + (uCount >> 4))
& 0x0F0F0F0F0F0F0F0F) % 255;
}
Here's a version in Pascal
function bitcount(n: uint64): cardinal;
var ucount: uint64;
begin
ucount:= n - ((n shr 1) and $7777777777777777)
- ((n shr 2) and $3333333333333333)
- ((n shr 3) and $1111111111111111);
Result:= ((ucount + (count shr 4))
and $0F0F0F0F0F0F0F0F) mod 255;
end;
I'm looking to count the bits in this structure in parallel.
32-bit word where the pixels are laid out as follows.
lo-byte lo-byte neighbor
0 4 8 C 048C 0 4 8 C
+---------------+
1|5 9 D 159D 1|5 9 D
| |
2|6 A E 26AE 2|6 A E
+---------------+
3 7 B F 37BF 3 7 B F
|-------------| << slice A
|---------------| << slice B
|---------------| << slice C
Notice how this structure has 16 bits in the middle that need to be looked up.
I want to calculate neighbor counts for each of the 16 bits in the middle using SSE2.
In order to do this I put slice A in XMM0 low-dword, slice B in XXM0-dword1 etc.
I copy XMM0 to XMM1 and I mask off bits 012-456-89A
for bit 5
in the low word of XMM0, do the same for word1 of XMM0, etc. using different slices and masks to make sure each word in XMM0 and XMM1 holds the neighbors for a different pixel.
Question
How do I tweak the MIT-bitcount to end up with a bitcount per word/pixel in each XMM word?
Remarks
I don't want to use a lookup table, because I already have that approach and I want to
test to see if SSE2 will speed up the process by not requiring memory accesses to the lookup table.
An answer using SSE assembly would be optimal, because I'm programming this in Delphi and I'm thus using x86+SSE2 assembly code.