3

I need implement a vpgatherdd-like mechanism without AVX2.

Say, I have 4 i32 offset packed in xmm0. I will need to extract each element in xmm0, to do the mov reg, [base + offset] job.

The problem is that how should I extract the elements?

There is pextrd, whose latency is 3. And I'm not doing this quite stream-like, the offsets are calculated in 8 or 4 each time on-the-fly.

And there is psrldq, whose latency is 1, with one movd, it seems that I can do the extract job in 2 cycles.

Though 1 cycle is not much, I wonder which is better here. And could such latency be hidden by next 8-offsets-extraction?

Is there a general rule of picking long latency instructions vs short latency instruction groups?

BlueWanderer
  • 2,671
  • 2
  • 21
  • 36
  • Why do you care about latency, not throughput? – Marat Dukhan Jun 21 '13 at 08:01
  • 1
    I suggest you check out the `lookup` function in Agner Fog's vectorclass. This is a gather function that he has implemented that works for SSE2-AVX2. He claims the speed is medium for non-AVX2 and fast for AVX2. Check out the source code in e.g vectorf128.h and read pages 44-45 of his manual VectorClass.pdf –  Jun 21 '13 at 09:14
  • @raxman His AVX implementation will result in a read-forward violation stall in addition fetching data from cache if the compiler doesn't eliminate the memory operation (and seems VC++ and G++ both don't). – BlueWanderer Jun 21 '13 at 10:16
  • @MaratDukhan that's why I wonder if the latency could be hidden, I'm not quite sure about how the throughput work. – BlueWanderer Jun 21 '13 at 10:17
  • @BlueWanderer, yes I think you're right, it's inefficient due to a store forwarding stall. It works well for small values of n (array size to gather from) but not for large values of n. I don't have a better suggestion. –  Jun 21 '13 at 10:33
  • @BlueWanderer, the latency can be hidden, and in most cases you should optimize for throughput to get high overall performance – Marat Dukhan Jun 21 '13 at 11:02

0 Answers0