I have the following type of code
short v[8] __attribute__ (( aligned(16)));
...
// in an inlined function :
_mm_store_si128(v, some_m128i_value);
... // some more operation (4 additions )
outp[0] = v[1] / 2; // <- first access of v since the previous store
When I annotate this code with perf, this single line is accounting for 18 % of the whole sampling ! When I say line, it is at the assembly level, ie the instruction immediately after the move from v count for 18 %
Is it a cache miss ? How can I test that ?
I don't really need to store the result, but how can I avoid a round trip to memory, and still individually access the 8 short composing my m128i value.
Update : If I use _mm_extract_epi16, then the overall performance is not better, but the waiting is equally divided between each access instead of hitting just the first.