I want to write a program for GPU (preferrably OpenCL) and a large part of the computation consists of counting the number of 1's in a bit array (packed as long or int).
So, on modern CPUs I would obviously just use the native __popcnt instruction. I read on several places on the internet that modern GPUs, this instruction is also present in the hardware, which would be a huge speedup for me. (at least for 32-bit, not sure about 64)
However, I find nowhere how to us this instruction. So:
1) how should I find out which GPUs have this instruction? (I still need to buy my GPU, so it will be a modern high-end one... probably Radeon HD7000 series or nVidia Kepler)
2) how to call this instruction from OpenCL (or a similar GPU language)?