Efficiently gather individual bytes, separated by a byte-stride of 4

Question

I'm trying to optimize an algorithm that will process massive datasets that could strongly benefit from AVX SIMD instructions. Unfortunately, the input memory layout is not optimal for the required computations. Information must be reordered, by assembling __m256i values from individual bytes that are exactly 4 bytes apart:

BEGIN EDIT

My target CPUS do not support AVX2 instructions, so like @Elalfer and @PeterCordes pointed out, I can't make use of __m256i values, code must be converted to use __m128i values instead)

END EDIT

DataSet layout in memory

Byte 0   | Byte 1   | Byte 2   | Byte 3
Byte 4   | Byte 5   | Byte 6   | Byte 7
...
Byte 120 | Byte 121 | Byte 122 | Byte 123
Byte 124 | Byte 125 | Byte 126 | Byte 127

Desired values in __m256i variable:

| Byte 0 | Byte 4 | Byte 8 |     ...     | Byte 120 | Byte 124 |

Is there a more efficient way to gather and rearrange the strided data other than this straightforward code?

union {  __m256i   reg;   uint8_t bytes[32]; } aux;
...
for( int i = 0; i < 32; i++ )
    aux.bytes[i] = data[i * 4];

Edit:

The step I'm trying to optimize is a bit column transposition; in other words, the bits of a certain column (32 possible bit columns in my data arrangement) should become a single uint32_t value, while the rest of the bits are ignored.

I perform the transposition by rearranging the data as shown, performing a left shift to bring the desired bit column as the most significant bits in each sub-byte, and finally extract and assemble the bits into a single uint32_t value via the _mm256_movemask_epi8() intrinsic.

And do you process them as bytes later on in your algorithm? — Elalfer, Aug 13 '15 at 04:25
Why don't you load 4 256 bits chunks and re-arrange them into 4 256 bit vectors? — user3528438, Aug 13 '15 at 13:45
@Elalfer, yes ordering is important, and yes, once loaded, I perform byte level operations (In fact I perform a bit transposition) — BlueStrat, Aug 13 '15 at 15:12
@user3528438, sounds interesting, could you elaborate a little bit more? — BlueStrat, Aug 13 '15 at 15:32

Elalfer · Answer 1 · 2015-08-19T17:03:47.513

4

One of the ways would be - pack the bytes with _mm256_shuffle_epi8, blend all _mm256_blend_epi32 resulting vectors (you'll need to do 4 such load+shuffle), and do a single 32bit permute _mm256_permutevar8x32_epi32.

Here is a pseudo code (I hope you can come up with the shuffle masks):

L1 = load32byte(buf)
L2 = load32byte(buf+32)
L3 = load32byte(buf+64)
L4 = load32byte(buf+96)

// Pack 4 bytes in the corresponding 32bit DWORD in each lane and zero-out other bytes
L1 = shuffle(L1, mask_for_L1)   
L2 = shuffle(L2, mask_for_L2)
L3 = shuffle(L3, mask_for_L3)
L4 = shuffle(L4, mask_for_L4)

// Vec = blend(blend(L1,L2),blend(L3,L4))
Vec = or(or(or(L1,L2),L3),L4)
Vec = permute(Vec)  // fix DWORD order in the vector

Update: Forgot the reason I said "zero-out other bytes" - this way you can replace blend with or

Update: Reduced one cycle latency by rearranging or operations per Peter's comment below.

PS. I'd also recommend you to take a look at the BMI Instruction Set as you do bit manipulations.

edited Aug 19 '15 at 17:03

answered Aug 14 '15 at 15:26

Elalfer

5,312
20
25

Nice solution! Unfortunately I can't use BMI instructions, as the servers that will run this code don't have CPUs with BMI support. Thanks! – BlueStrat Aug 17 '15 at 22:30
1

BMI suppose to be supported on all platforms with AVX2 support for both Intel & AMD. – Elalfer Aug 18 '15 at 00:33
you're totally right. I stated AVX2 instructions but in reality my target CPUs only support AVX. Still I could apply your technique. Thanks again! – BlueStrat Aug 18 '15 at 16:07
@BlueStrat and Elalfer: if you want to do a 128b version of this, you could use `punpcklqdq` to combine registers (`dest[64:127] = src2[0:63]`). This would cut down on needing different shuffle masks. Also possibly interesting is AND-masking the bytes you don't want to zero, then using `packusdw` to squish dwords from 2 concatenated regs down to words in 1 reg. Neither of these is great, and I think using `POR` for the merging is the best bet. pack/punpck won't help at all for the 256b case, because it does two separate 128b lane operations. – Peter Cordes Aug 18 '15 at 20:17
Actually, in the 256b case, you could just slap a permute at the end, like this Elalfer's solution. `pshufb` has the same in-lane behaviour as `pack/punpck`, which is what the final permute fixes. You'd have 4 ANDs with the same mask, then two `packusdw`, then one `packuswb`, then a permute. I'll post an answer with this. – Peter Cordes Aug 18 '15 at 20:46
@PeterCordes you are right, one might use `unpack` instructions in order to merge the regs. There are a couple of reasons why I used `pshufb` & `por` - OOO will load masks way in advance so it is not a problem, there are more execution units for `por` than for `pack/unpack` and `por` will be executed in parallel with `pshufb` (at least for Haswell/Broadwell microarchitectures). – Elalfer Aug 19 '15 at 16:42
@Elalfer: See my answer for code. I do 4 `pand` (any port) and 3 `pack` insns (shuffle port). You do 4 `pshufb` (shuffle port), and 3 `por`. So in my case, the dependent operations (`pack`) are also the ones with more limited throughput. Also, I agree it's a minor advantage to save on masks, but with mine you could even generate the mask with 2 instructions before a loop instead of loading it, thus not touching any data cache lines. (`pcmpeq xmm5, xmm5 / psrld xmm5, 24`, which takes prob. only 8 insn bytes) – Peter Cordes Aug 19 '15 at 16:50
1

@Elalfer: In your version, the dep chain would be shorter if you did `L1=or( L4, or(L3, or(L1,L2)))`. Then instead of 2 cycles of dependency after the 4th `pshufb`, you only have one. cycle1, 2: pshufb only. cycle3: pshufb+por. cycle4: pshufb+por cycle5: por. (Different on SnB/IvB, and Nehalem, which have 2 shuffle ports. Core2 has a slow pshufb.) – Peter Cordes Aug 19 '15 at 16:55

score 2 · Answer 2 · answered Aug 13 '15 at 02:47

2

You can try unrolling that loop, this should at least get rid of one comparison (i<32), one increment (i++) and one multiplication (i*4) in the loop's body. Also constant array offsets might work slightly faster than variable. But note that your compiler might generate similar (or better) code anyway, with the appropriate compilation options enabled.

union {  __m256i   reg;   uint8_t bytes[32]; } aux;
...
aux.bytes[0] = data[0];
aux.bytes[1] = data[3];
...
aux.bytes[31] = data[124];

answered Aug 13 '15 at 02:47

davlet

527
3
12

Thanks for the tip, indeed the compiler unrolled the loop (not completely though), the ugly was that a lot of redundant loads and stores were performed when filling the union byte-wise. I ended up applying @Elalfer 's solution to get rid of this problem. – BlueStrat Aug 17 '15 at 22:33
@BlueStrat and davlet: the other weakness of this solution is that a store->load forwarding stall is guaranteed, because Intel and AMD CPUs both fail to forward multiple smaller stores to a wider load. So there's an extra ~10 cycle latency penalty after all the byte-by-byte writes. – Peter Cordes Aug 18 '15 at 20:47

score 2 · Accepted Answer · edited May 23 '17 at 12:01

I only just noticed the edit, which has a special-case answer.

If you need to do many different bit positions on the same data, then your current plan is good.

If you only need one bit position (esp. the highest bit position) from 128B of memory, you could use _mm256_movemask_ps to get the high bit from each 32b element. Then combine four 8bit masks in GP registers.

A good compiler should optimize that to:

vmovdqu   ymm0, [buf + 0]
; to select a different bit:
; vpslld  ymm0, ymm0, count   ; count can be imm8 or the low byte of an xmm register
vmovmskps eax, ymm0

vmovdqu   ymm0, [buf + 32]
vmovmskps ebx, ymm0

...  ecx and edx

mov       ah, bl
mov       ch, dl
shl       ecx, 16
or        eax, ecx

This is nice only if you're testing the high bit (so you don't need to shift each vector before vmovmsk). Even so, this is probably more instructions (and code size) than the other solution.

Answer to the original question:

Similar to Elalfer's idea, but use the shuffle unit for pack instructions instead of pshufb. Also, all the ANDs are independent, so they can execute in parallel. Intel CPUs can do 3 ANDs at once, but only one shuffle. (Or two shuffles at once on pre-Haswell.)

// without AVX2: you won't really be able to
// do anything with a __m256i, only __m128i
// just convert everything to regular _mm_..., and leave out the final permute

mask = _mm256_set1_epi32(0x000000ff);

// same mask for all, and the load can fold into the AND
// You can write the load separately if you like, it'll still fold
L1 = and(mask, (buf))     // load and zero the bytes we don't want
L2 = and(mask, (buf+32))
L3 = and(mask, (buf+64))
L4 = and(mask, (buf+96))

// squish dwords from 2 concatenated regs down to words in 1 reg
pack12 = _mm256_packus_epi32(L1, L2);
pack34 = _mm256_packus_epi32(L3, L4);

packed = _mm256_packus_epi16(pack12, pack34);  // note the different width: zero-padded-16 -> 8

Vec = permute(packed)  // fix DWORD order in the vector (only needed for 256b version)

Vec = shift(Vec, bit_wanted)
bitvec = movemask(Vec)

    // shift:
    //  I guess word or dword granularity is fine, since byte granularity isn't available.
    //  You only care about the high bit, so it doesn't matter than you're not shifting zeroes into the bottom of each byte.

    // _mm_slli_epi32(Vec, imm8): 1 uop, 1c latency if your count is a compile-time constant.
    // _mm_sll_epi32 (Vec, _mm_cvtsi32_si128(count)): 2uop 2c latency if it's variable.

    // *not* _mm_sllv_epi32(): slower: different shift count for each element.

If you're doing this with just AVX (like you said) then you won't have 256b integer instructions available. Just build 128b vectors, and get 16b at a time of mask data. You won't need a final permute at the end.

Merge masks with integer instructions: (m2<<16) | m1. If desired, even go up to 64b of mask data, by combining two 32b masks.

Performance: This avoids the need for separate load instructions with AVX, since vpand can micro-fuse a memory operand if used with a one-register addressing mode.

cycle 1: 3 vpand instructions. (or only 2, if we were waiting on the address, since there's only 2 load ports.)
cycle 2: last one or two vpand, one pack (L1, L2)
cycle 3: next pack (L3, L4)
cycle 4: final pack
// 256b AVX2: permute
cycle 5: packed shift with imm8 count: 1 uop, 1c latency.
cycle 6: movemask (3 cycle latency)

Latency = 8 (SnB and later)

Throughput: 3 shuffles (p5), 4 logicals (p015), 1 shift (p0), 1 pmovmsk (p0). 4 load uops.

SnB/IvB: 9 ALU uops -> 3c. 4 memory reads: 2c.
So depending on what you're doing with the masks, 3 accumulators would be needed to keep the execution ports saturated. (ceil(8/3) = 3.).

With shift count in a variable that can't be resolved to a compile-time constant by compiler inlining / unrolling: latency = 9. And the shift produces another uop for p1/p5.

With AVX2 for Haswell and later, there's another 3 extra latency for the vpermd.

Thanks @PeterCordes ! You're right! I didn't notice that the _mm256_shuffle_epi8 intrinsic is tagged AVX2. It works on my development machine, but won't work on the target servers (Sandy Bridge). — BlueStrat, Aug 18 '15 at 22:55
@BlueStrat: Yeah, just work with 128b vectors using VEX-coded instructions (compile with AVX support, and make sure the disassembly shows `vpand`, not `pand`, etc.) All the integer 256b stuff is AVX2-only, except `loadu_si256`. You won't be able to do the bitwise shifts you need with 256b vectors. (But 3-operand non-destructive operations are great for saving on `mov` instructions, though. Which is an even bigger win on SnB, because handling `mov*` instructions at the register-rename stage didn't arrive until IvyBridge.) — Peter Cordes, Aug 19 '15 at 01:10
thank you very much again- I am very curious, you exhibit tremendous knowledge on SIMD programming and processor related information. Did you gain it by some special training? — BlueStrat, Aug 19 '15 at 15:09
@BlueStrat: I trained for many years with a shao-lin monk master computer-fu. No seriously, I just like knowing how things *really* work, and tweaking / optimizing things. I learned everything I know in this area from reading stuff online, with occasional experiments with perf counters. I read http://realworldtech.com/ articles about CPU designs, and http://agner.org/optimize/ guides to CPU internals. Once you have a mental model of what your CPU is doing, the pieces fit together pretty well, and any new information slots in. — Peter Cordes, Aug 19 '15 at 15:24
@BlueStrat: Also get a copy of Intel's instruction reference manual. I've updated http://stackoverflow.com/tags/x86/info recently with links to good stuff. — Peter Cordes, Aug 19 '15 at 15:27
https://software.intel.com/sites/landingpage/IntrinsicsGuide/ - good quick reference on intrinsics. I use it all the time. — Elalfer, Aug 19 '15 at 16:45
@PeterCordes wish I could upvote this more, thanks for the detailed cycle count! — BlueStrat, Aug 19 '15 at 20:17
@BlueStrat: you can change which answer you marked as accepted. Elalfer's answer is good, but I think mine's better :) Glad you found this useful. :) — Peter Cordes, Aug 19 '15 at 20:34

Efficiently gather individual bytes, separated by a byte-stride of 4

3 Answers3