I need implement a vpgatherdd-like mechanism without AVX2.
Say, I have 4 i32 offset packed in xmm0. I will need to extract each element in xmm0, to do the mov reg, [base + offset]
job.
The problem is that how should I extract the elements?
There is pextrd
, whose latency is 3. And I'm not doing this quite stream-like, the offsets are calculated in 8 or 4 each time on-the-fly.
And there is psrldq
, whose latency is 1, with one movd
, it seems that I can do the extract job in 2 cycles.
Though 1 cycle is not much, I wonder which is better here. And could such latency be hidden by next 8-offsets-extraction?
Is there a general rule of picking long latency instructions vs short latency instruction groups?