Implementation and performance of using bitsets with SSE

Question

I am trying to speed up my method using SSE (On Visual Studio). I am a novice in the area. The main data types I work with in my method are bitsets of size 32 and the logical operation I mainly use is the AND operation (with _BitScanForward scarcely used). I was wondering if SSE instructions can be used to speed up my procedures.

This is how I am doing right now (I am completely done and cannot compare results directly):

I load the operands (bitsets) using _mm_set_ps. I use the to_ulong() on bitsets to convert them to unsigned long integers:

__m128 v1 = _mm_set_ps(b1.to_ulong(),b2.to_ulong(),b3.to_ulong(),b4.to_ulong());
__m128 v2 = _mm_set1_ps(b.to_ulong())

This is followed by the actual AND operation:

__m128 v3 = _mm_and_ps(v1,v2);

At this point, I have two questions:

Is the way I am doing it (converting bitsets to unsigned long integers using to_ulong()) a good way to do it? I suspect that there is a large overhead that may kill the potential performance improvement I may get out of using SSE.
What is the best way to store v3 back on memory in the shape of 4 bitsets? I am planning to use the _mm_storeu_ps intrinsic.

Paul R · Accepted Answer · 2012-05-29T16:00:27.247

A couple of things:

if your bit sets are basically 32 bit ints then you should be using a suitable integer SIMD type, i.e. __m128i, not floating point (__m128)
_mm_set_XXX macros are relatively expensive - unlike regular SSE intrinsics they can generate quite a few instructions - if all you are doing is one AND operation then any performance benefit from the _mm_and_XXX operation will be more than wiped out by the cost of the _mm_set_XXX operations

Ideally if you just want to AND a bunch of bit sets in arrays then the code should look something like this:

const int N = 1024;

int32_t b1[N]; // 2 x arrays of input bit sets
int32_t b2[N];
int32_t b3[N]; // 1 x array of output bit sets

for (int i = 0; i < N; i += 4)
{
    __m128i v1 = _mm_loadu_si128(&b1[i]); // load input bits sets
    __m128i v2 = _mm_loadu_si128(&b2[i]);
    __m128i v3 = _mm_and_si128(v1, v2);   // do the bitwise AND
    _mm_storeu_si128(&b3[i], v3);         // store the result
}

If you just want to AND an array in-place with a fixed mask then it would simplify to this:

const int N = 1024;

int32_t b1[N]; // input/output array of bit sets

const __m128i v2 = _mm_set1_epi32(0x12345678); // mask

for (int i = 0; i < N; i += 4)
{
    __m128i v1 = _mm_loadu_si128(&b1[i]); // load input bits sets
    __m128i v3 = _mm_and_si128(v1, v2);   // do the bitwise AND
    _mm_storeu_si128(&b1[i], v3);         // store the result
}

Note: for better performance make sure your input/output arrays are 16 byte aligned and then use _mm_load_si128/_mm_store_si128 rather than their unaligned counterparts as above.

Paul, _mm_set1_epi32 doesn't work with bitset<32> instances. Is there an alternative that works with actual bitset instances? — SMir, May 31 '12 at 05:10
I'm not a C++ expert but I expect that you can convert a bitset<32> to a 32 bit int quite easily, either using an existing method or by writing a helper function. — Paul R, May 31 '12 at 05:28
You might not be a C++ expert, but you've helped me with my SSE questions a lot! I just didn't want to use casting/conversion to avoid overhead. Have a good time sir! — SMir, May 31 '12 at 07:05
If you're only setting a single mask value with `_mm_set1_epi32` *outside* the main loop then efficiency isn't too important - if you need to do this *inside* the loop though then you may need to look more carefully at how you implement the conversion between bitset<32> and int. I mostly use C for high performance SIMD code to avoid this kind of problem. — Paul R, May 31 '12 at 09:51

Implementation and performance of using bitsets with SSE

1 Answers1