1

I'm trying my first steps with SIMD and I was wondering what the right approach is to the following problem. Consider two vectors:

+---+---+---+---+    +---+---+---+---+
| 0 | 1 | 2 | 3 |    | 4 | 5 | 6 | 7 |
+---+---+---+---+    +---+---+---+---+

How to "interleave" the elements of those vectors so that they become:

+---+---+---+---+    +---+---+---+---+
| 0 | 4 | 1 | 5 |    | 2 | 6 | 3 | 7 |
+---+---+---+---+    +---+---+---+---+

I was surprised I could not find an instruction for doing it, given the great many kinds of shuffles, broadcasts, permutes, ... Probably it could be done with some unpacklo and unpackhi and what not, but I was wondering if there is a canonical way of doing it as it seems to be quite common problem (SoA vs. AoS). For simplicity let's assume AVX(2) and vectors of four floats.

Edit:

Floats vs. doubles

The comment below (correctly) suggest I should use unpcklps and unpckhps for floats. Which instruction should I use to unpack vector of four doubles? I'm asking because _mm256_unpacklo_pd/_mm256_unpackhi_pd:

Unpack and interleave double-precision (64-bit) floating-point elements from the high half of each 128-bit lane in a and b, and store the results in dst.

DEFINE INTERLEAVE_HIGH_QWORDS(src1[127:0], src2[127:0]) {
    dst[63:0] := src1[127:64] 
    dst[127:64] := src2[127:64] 
    RETURN dst[127:0]   
}
dst[127:0] := INTERLEAVE_HIGH_QWORDS(a[127:0], b[127:0])
dst[255:128] := INTERLEAVE_HIGH_QWORDS(a[255:128], b[255:128])
dst[MAX:256] := 0

So what it apparently does is:

+---+---+---+---+    +---+---+---+---+
| 0 | 4 | 2 | 6 |    | 1 | 5 | 3 | 7 |
+---+---+---+---+    +---+---+---+---+
Ecir Hana
  • 10,864
  • 13
  • 67
  • 117
  • 6
    `unpcklps` and `unpckhps` instructions do exactly what you want. See also my article about the subject http://const.me/articles/simd/simd.pdf float shuffles are on page 11, integer shuffles on page 16. – Soonts Mar 16 '21 at 23:09
  • If I wanted to unpack doubles, instead of floats, which instruction should I use? Apparently `_mm256_unpacklo_pd` "interleave[s] ... elements from the low half of each 128-bit lane". – Ecir Hana Mar 17 '21 at 07:32
  • 2
    Right, 256-bit AVX shuffles are "in-lane", like two separate 128-bit `unpcklpd` operations in each lane of the __m256[i/d] vector. So you'd probably want `vinsertf128` (for low halves) or `vperm2f128` (for high halves) to get the data you want into a single vector + `vpermpd` to reorder it within that vector. It's not until AVX-512 that Intel introduced proper 2-vector lane-crossing shuffles like `vpermt2pd` that avoid this level of suckage. However, sometimes instead of 2 shuffles per vector output, you can do some shuffle and some blend, improving throughput. – Peter Cordes Mar 17 '21 at 07:53
  • @PeterCordes I know I'm just at the beginning of this whole SIMD thing so I don't understand most of the stuff but "this level of suckage" occurred to me more than once. I see that those instruction you mention have higher latency than `unpcklps`/`unpckhps`, do you recommend any particular sequence of instructions so that it does not hurt too much (to emulate `vpermt2pd` or similar)? – Ecir Hana Mar 17 '21 at 08:04
  • 1
    Usually throughput matters more than latency, thanks to out-of-order exec being able to overlap independent iterations. But Intel CPUs (from Haswell until Ice Lake) only have 1/clock shuffle throughput so it's usually just total number of shuffles you want to minimize. Unless there's something clever that allows one or two vblendpd instead of one of the 4 shuffles, probably just what I suggested. – Peter Cordes Mar 17 '21 at 08:17
  • 1
    Maybe lane-swap one input with `vperm2f128`, then blend 2 different ways to create `0 1 | 4 5` and `6 7 | 2 3`, then 2x `vpermpd` to shuffle the data into place. That avoids needing any vector constants, which is nice, and costs 2 cheap blends + 3 shuffles, vs. 4 shuffles. So it's worse for front-end throughput, better for back-end port 5 pressure. IDK if there's an existing Q&A showing any more clever trick. – Peter Cordes Mar 17 '21 at 08:19
  • Interleaving pairs of 8x float vectors (`__m256`) might use a similar combo of shuffle and blend. – Peter Cordes Mar 17 '21 at 08:21
  • 1
    For 256 bit vectors you can also consider to `vextractf128` or `vinsertf128` directly to or from memory (before/after `unpck[lh]p[sf]`) which does not cause any shuffle operation, but requires more stores or loads (really depends on what you want to do overall). – chtz Mar 17 '21 at 09:46
  • 1
    For questions like this, generally the best place to start is by [asking the compiler](https://godbolt.org/z/6fsx9Y). `__builtin_shufllevector` (clang) and `__builtin_shuffle` (GCC) tend to be really good at this. – nemequ Mar 17 '21 at 13:02
  • What does a straight C or Fortran loop compile to in terms of the instructions? Whatever that optimises to may provide an interesting comparison. – Holmz Mar 17 '21 at 21:33

0 Answers0