14

I have a big pixel processing function which I am currently trying to optimize using intrinsic functions.

Being an SSE novice, I am not sure how to tackle the part of the code which involves lookup tables.

Basically, I am trying to vectorize the following vanilla C++ code:

 //outside loop
const float LUT_RATIO = 1000.0F;

//in loop
float v = ... //input value
v = myLookupTable[static_cast<int>(v * LUT_RATIO)];

What I'm trying:

//outside loop
const __m128 LUT_RATIO = _mm_set1_ps(1000.0F);

//in loop
__m128 v = _mm_set_ps(v1, v2, v3, v4); //input values
__m128i vI = _mm_cvtps_epi32(_mm_mul_ps(v, LUT_RATIO)); //multiply and convert to integers
v = ??? // how to get vI indices of myLookupTable?

edit: ildjarn makes a point that demands clarification on my part. I am not trying to achieve speedup for the lookup table code, I am simply trying to avoid having to store the registers back to floats specifically for doing the lookup, as this part is sandwiched between 2 other parts which could theoretically benefit from SSE.

Rotem
  • 21,452
  • 6
  • 62
  • 109
  • Who has you convinced that you can improve on `myLookupTable[static_cast(v) * LUT_RATIO]`? There's no computation being performed here, why would SSE be applicable? – ildjarn May 14 '12 at 20:55
  • 2
    @ildjarn I am pretty sure I can't improve this part per se, but I am hoping to improve other parts of the function, and to avoid the penalty of moving back and forth between `__m128` and `float[4]` I must also vectorize this code. – Rotem May 14 '12 at 20:56

1 Answers1

20

If you can wait until next year then Intel's Haswell CPUs will have AVX2 which includes instructions for gathered loads. This enables you to do e.g. 8 parallel LUT lookups in one instruction (see e.g. VGATHERDPS). Other than that, you're out of luck, unless your LUTs are quite small (e.g. 16 elements), in which case you can use PSHUFB.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • Unfortunately my LUTs are 10000 elements large. Even if I were to wait for a new processor, it would be years until it would be legitimate to specify Haswell as a minimum cpu. :) Thanks for the info. – Rotem May 14 '12 at 21:06
  • 1
    OK - if you can approximate your LUTs, e.g. with a polynomial then you may still get a win with SSE, otherwise I'm afraid you're stuck with scalar code. – Paul R May 14 '12 at 21:16
  • 3
    Scalar code it is then. This is good news in a way, I can stop worrying about this part and go work on parts that might prove more optimizable. – Rotem May 14 '12 at 21:20
  • Why wouldn't something like that work: _mm_storeu_si128((__m128i*) LutIndex, _mm_cvtps_epi32(_mm_mul_ps(LUT_RATIO, floatData))); __m128 www = _mm_set_ps(myLUT[LutIndex[3]], myLUT[LutIndex[2]], myLUT[LutIndex[1]], myLUT[LutIndex[0]]); – Royi Aug 01 '16 at 06:51
  • @Drazick: it would work, but there is a lot of scalar code and multiple memory accesses hidden behind that `_mm_set_ps` intrinsic. – Paul R Aug 01 '16 at 08:57
  • @PaulR, Is there a better way to do it using SSE? How would you do it with AVX? – Royi Aug 01 '16 at 11:28
  • Not really - you need AVX2 or AVX-512 for gathered loads. – Paul R Aug 01 '16 at 11:42