I am trying to speed up my method using SSE (On Visual Studio). I am a novice in the area. The main data types I work with in my method are bitsets of size 32 and the logical operation I mainly use is the AND operation (with _BitScanForward scarcely used). I was wondering if SSE instructions can be used to speed up my procedures.
This is how I am doing right now (I am completely done and cannot compare results directly):
I load the operands (bitsets) using _mm_set_ps. I use the to_ulong() on bitsets to convert them to unsigned long integers:
__m128 v1 = _mm_set_ps(b1.to_ulong(),b2.to_ulong(),b3.to_ulong(),b4.to_ulong());
__m128 v2 = _mm_set1_ps(b.to_ulong())
This is followed by the actual AND operation:
__m128 v3 = _mm_and_ps(v1,v2);
At this point, I have two questions:
Is the way I am doing it (converting bitsets to unsigned long integers using to_ulong()) a good way to do it? I suspect that there is a large overhead that may kill the potential performance improvement I may get out of using SSE.
What is the best way to store v3 back on memory in the shape of 4 bitsets? I am planning to use the _mm_storeu_ps intrinsic.