Using the blend instructions in intel intrinsics (AVX)

Question

I have a question regarding the AVX _mm256_blend_pd function.

I want to optimize my code where I use heavily the _mm256_blendv_pd function. This unfortunately has a pretty high latency and low throughput. This function takes as input three __m256d variables where the last one represents the mask that is used to select from the first 2 variables.

I found another function (_mm256_blend_pd) which takes a bit mask instead of a __m256d variable as mask. When the mask is static I could simply pass something like 0b0111 to take the first element from the first variable and the last 3 elements of the second variable. However in my case the mask is computed using _mm_cmp_pd function which returns a __m256d variable. I found out that I can use _mm256_movemask_pd to return an int from the mask, however when passing this into the function _mm256_blend_pd I get an error error: the last argument must be a 4-bit immediate.

Is there a way to pass this integer using its first 4 bits? Or is there another function similar to movemask that would allow me to use _mm256_blend_pd? Or is there another approach I can use to avoid having a cmp, movemask and blend that would be more efficient for this use case?

Peter Cordes · Accepted Answer · 2020-05-21T02:24:59.353

_mm256_blend_pd is the intrinsic for vblendpd which takes its control operand as an immediate constant, embedded into the machine code of the instruction. (That's what "immediate" means in assembly / machine code terminology.)

In C++ terms, the control arg must be constexpr so the compiler can embed it into the instruction at compile time. You can't use it for runtime-variable blends.

It's unfortunate that variable-blend instructions like vblendvpd are slower, but they're "only" 2 uops on Skylake, with 1 or 2 cycle latency (depending on which input you're measuring the critical path through). (uops.info). And on Skylake those uops can run on any of the 3 vector ALU ports. (Worse on Haswell/Broadwell, though, limited to port 5 only, competing for it with shuffles). Zen can even run them as a single uop.

There's nothing better for the general case until AVX512 makes masking a first-class operation you can do as part of other instructions, and gives us single-uop blend instructions like vblendmpd ymm0{k1}, ymm1, ymm2 (blend according to a mask register).

In some special cases you can usefully _mm256_and_pd to conditionally zero instead of blending, e.g. to zero an input before an add instead of blending after.

TL:DR: _mm256_blend_pd lets you use a faster instruction for the special case where the control is a compile-time constant.

Oh I see, now it makes more sense. I bit unfortunate but thanks a lot for the clarification. I have to stick with blendv then. — DevX10, May 21 '20 at 02:36
@anatolyg: `vpermilps` does take a vector control operand, but it's a shuffle with only 1 data input. I'm not seeing any obvious way to blend with it. If you have something in mind, maybe ask the OP in a comment under the question. — Peter Cordes, May 21 '20 at 02:43

Using the blend instructions in intel intrinsics (AVX)

1 Answers1