1

I have the following type of code

short v[8] __attribute__ (( aligned(16)));
...
// in an inlined function :
_mm_store_si128(v, some_m128i_value);
... // some more operation (4 additions )
outp[0] = v[1] / 2; // <- first access of v since the previous store 

When I annotate this code with perf, this single line is accounting for 18 % of the whole sampling ! When I say line, it is at the assembly level, ie the instruction immediately after the move from v count for 18 %

Is it a cache miss ? How can I test that ?

I don't really need to store the result, but how can I avoid a round trip to memory, and still individually access the 8 short composing my m128i value.

Update : If I use _mm_extract_epi16, then the overall performance is not better, but the waiting is equally divided between each access instead of hitting just the first.

shodanex
  • 14,975
  • 11
  • 57
  • 91
  • You have a store forwarding problem. You might see the details in Intel Optimization manual. In short, what you could try to do it to load a the first dword of v and shift it to get the second word. – Marat Dukhan Dec 11 '11 at 19:46

1 Answers1

6

Instead of doing a SIMD store followed by scalar loads you should be using _mm_extract_epi16 (PEXTRW) to get 16 bit scalar values directly from your 128 bit SSE register without going via memory, e.g.

outp[0] = _mm_extract_epi16(some_m128i_value, 6);
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 3
    And of course I was asleep 1 hour ago... +1. I wanna add this. You usually want to avoid immediately accessing the same memory with a different word size. Most processor load/store units aren't optimized to handle these situations and will end up flushing everything to cache and reading back again - often leading to a 10+ cycle penalty. – Mysticial Dec 08 '11 at 19:18
  • Wow ! that's a very interesting comment ! I really did not understand why the v array would somehow get flushed. – shodanex Dec 08 '11 at 19:48