Using AVX/AVX2 intrinsics, I can gather sets of 8 values, either 1,2 or 4 byte integers, or 4 byte floats using:
_mm256_i32gather_epi32()
_mm256_i32gather_ps()
But currently, I have a case where I am loading data that was generated on an nvidia GPU and stored as FP16 values. How can I do vectorized loads of these values?
So far, I found the _mm256_cvtph_ps() intrinsic.
However, input for that intrinsic is a __m128i value, not a __m256i value.
Looking at the Intel Intrinsics Guide, I see no gather operations that store 8 values into an _mm128i register?
How can I gather FP16 values into the 8 lanes of a __m256 register? Is it possible to vector load them as 2-byte shorts into __m256i and then somehow reduce that to a __m128i value to be passed into the conversion intrinsic? If so, I haven't found intrinsics to do that.
UPDATE
I tried the cast as suggested by @peter-cordes but I am getting bogus results from that. Also, I don't understand how that could work?
My 2-byte int values are stored in __m256i as:
0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX
so how can I simply cast to __m128i where it needs to be tightly packed as
XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX
Will the cast do that?
My current code:
__fp16* fielddensity = ...
__m256i indices = ...
__m256i msk = _mm256_set1_epi32(0xffff);
__m256i d = _mm256_and_si256(_mm256_i32gather_epi32(fielddensity,indices,2), msk);
__m256 v = _mm256_cvtph_ps(_mm256_castsi256_si128(d));
But the result doesn't seem to be 8 properly formed values. I think every 2nd one is currently bogus for me?