I have a big pixel processing function which I am currently trying to optimize using intrinsic functions.
Being an SSE novice, I am not sure how to tackle the part of the code which involves lookup tables.
Basically, I am trying to vectorize the following vanilla C++ code:
//outside loop
const float LUT_RATIO = 1000.0F;
//in loop
float v = ... //input value
v = myLookupTable[static_cast<int>(v * LUT_RATIO)];
What I'm trying:
//outside loop
const __m128 LUT_RATIO = _mm_set1_ps(1000.0F);
//in loop
__m128 v = _mm_set_ps(v1, v2, v3, v4); //input values
__m128i vI = _mm_cvtps_epi32(_mm_mul_ps(v, LUT_RATIO)); //multiply and convert to integers
v = ??? // how to get vI indices of myLookupTable?
edit: ildjarn makes a point that demands clarification on my part. I am not trying to achieve speedup for the lookup table code, I am simply trying to avoid having to store the registers back to floats specifically for doing the lookup, as this part is sandwiched between 2 other parts which could theoretically benefit from SSE.