I have a question regarding the AVX _mm256_blend_pd
function.
I want to optimize my code where I use heavily the _mm256_blendv_pd
function. This unfortunately has a pretty high latency and low throughput. This function takes as input three __m256d
variables where the last one represents the mask that is used to select from the first 2 variables.
I found another function (_mm256_blend_pd
) which takes a bit mask instead of a __m256d
variable as mask. When the mask is static I could simply pass something like 0b0111
to take the first element from the first variable and the last 3 elements of the second variable. However in my case the mask is computed using _mm_cmp_pd
function which returns a __m256d
variable. I found out that I can use _mm256_movemask_pd
to return an int from the mask, however when passing this into the function _mm256_blend_pd
I get an error error: the last argument must be a 4-bit immediate
.
Is there a way to pass this integer using its first 4 bits? Or is there another function similar to movemask that would allow me to use _mm256_blend_pd
? Or is there another approach I can use to avoid having a cmp, movemask and blend that would be more efficient for this use case?