pextrd vs psrldp+movd vs others, Which is better for extracting one element from?

Question

I need implement a vpgatherdd-like mechanism without AVX2.

Say, I have 4 i32 offset packed in xmm0. I will need to extract each element in xmm0, to do the mov reg, [base + offset] job.

The problem is that how should I extract the elements?

There is pextrd, whose latency is 3. And I'm not doing this quite stream-like, the offsets are calculated in 8 or 4 each time on-the-fly.

And there is psrldq, whose latency is 1, with one movd, it seems that I can do the extract job in 2 cycles.

Though 1 cycle is not much, I wonder which is better here. And could such latency be hidden by next 8-offsets-extraction?

Is there a general rule of picking long latency instructions vs short latency instruction groups?

I suggest you check out the `lookup` function in Agner Fog's vectorclass. This is a gather function that he has implemented that works for SSE2-AVX2. He claims the speed is medium for non-AVX2 and fast for AVX2. Check out the source code in e.g vectorf128.h and read pages 44-45 of his manual VectorClass.pdf — , Jun 21 '13 at 09:14
@raxman His AVX implementation will result in a read-forward violation stall in addition fetching data from cache if the compiler doesn't eliminate the memory operation (and seems VC++ and G++ both don't). — BlueWanderer, Jun 21 '13 at 10:16
@MaratDukhan that's why I wonder if the latency could be hidden, I'm not quite sure about how the throughput work. — BlueWanderer, Jun 21 '13 at 10:17
@BlueWanderer, yes I think you're right, it's inefficient due to a store forwarding stall. It works well for small values of n (array size to gather from) but not for large values of n. I don't have a better suggestion. — , Jun 21 '13 at 10:33
@BlueWanderer, the latency can be hidden, and in most cases you should optimize for throughput to get high overall performance — Marat Dukhan, Jun 21 '13 at 11:02

0 Answers0