3

Using AVX/AVX2 intrinsics, I can gather sets of 8 values, either 1,2 or 4 byte integers, or 4 byte floats using:

_mm256_i32gather_epi32()

_mm256_i32gather_ps()

But currently, I have a case where I am loading data that was generated on an nvidia GPU and stored as FP16 values. How can I do vectorized loads of these values?

So far, I found the _mm256_cvtph_ps() intrinsic.

However, input for that intrinsic is a __m128i value, not a __m256i value.

Looking at the Intel Intrinsics Guide, I see no gather operations that store 8 values into an _mm128i register?

How can I gather FP16 values into the 8 lanes of a __m256 register? Is it possible to vector load them as 2-byte shorts into __m256i and then somehow reduce that to a __m128i value to be passed into the conversion intrinsic? If so, I haven't found intrinsics to do that.

UPDATE

I tried the cast as suggested by @peter-cordes but I am getting bogus results from that. Also, I don't understand how that could work?

My 2-byte int values are stored in __m256i as:

0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX

so how can I simply cast to __m128i where it needs to be tightly packed as

XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX

Will the cast do that?

My current code:

__fp16* fielddensity = ...
__m256i indices = ...
__m256i msk = _mm256_set1_epi32(0xffff);
__m256i d = _mm256_and_si256(_mm256_i32gather_epi32(fielddensity,indices,2), msk);
__m256 v = _mm256_cvtph_ps(_mm256_castsi256_si128(d));

But the result doesn't seem to be 8 properly formed values. I think every 2nd one is currently bogus for me?

Bram
  • 7,440
  • 3
  • 52
  • 94
  • 3
    There is no hardware support in x86 CPUs for gather (or scatter) with elements narrower than 32-bit. If you actually need gather for non-contiguous values, yes you probably want to gather 8x 32-bit elements and shuffle them down to 8x 16-bit elements in the bottom of a `__m256i`, and use that as a `__m128i` (with a cast). Careful that gathering the top element of your array can't cross into an unmapped page. And yes, the only x86 support for half-precision floats is converting them to/from single precision (until some future AVX512) – Peter Cordes Jun 16 '20 at 20:03
  • If you have multiple gathers to do, you might be able to amortize the packing by shuffling or blending 2 vectors together, then reordering stuff after converting up to float? – Peter Cordes Jun 16 '20 at 20:05
  • 1
    For the 16-bit gather part: [Gather AVX2&512 intrinsic for 16-bit integers?](https://stackoverflow.com/q/59339611) – Peter Cordes Jun 16 '20 at 20:07
  • And BTW, it shouldn't be surprising that a SIMD instruction or intrinsic like `_mm256_cvtph_ps` that widens every element has an input that's half the width of its output. – Peter Cordes Jun 16 '20 at 20:09
  • @PeterCordes Thanks, I am confused by your 1st statement. I am able to gather 8 bytes using scale value '1' in _mm256_i32gather_epi32() as long as I mask out all high bits afterwards. I tested that. And I am pretty sure with scale 2, I can do the same for 16b ints as well. About the cast: I can just do (__m128i) on a __m256i value? I'll try. – Bram Jun 16 '20 at 20:16
  • 2
    To be portable, you should use `_mm256_castsi256_si128` to cast from `__m256i` to `__m128i` (C-style casting may work on most compilers, though). – chtz Jun 16 '20 at 20:23
  • 3
    @Bram: As I understand that instruction, you're actually gathering 8 misaligned dwords. Of course you can then ignore, or mask off, everything but the low bytes, or as Peter suggests, you can reshuffle them instead. – Nate Eldredge Jun 16 '20 at 20:23

1 Answers1

3

There is indeed no gather instruction for 16bit values so you need to gather 32 bit values and ignore one half of them (and make sure that you don't accidentally read from invalid memory). Also, _mm256_cvtph_ps() needs all input values in the lower 128 bit lane and unfortunately, there is no lane-crossing 16 bit shuffle (until AVX512).

However, assuming you have only finite input values, you could do some bit-twiddling (avoiding the _mm256_cvtph_ps()). If you load a half precision value into the upper half of a 32 bit register you can do the following operations:

SEEEEEMM MMMMMMMM XXXXXXXX XXXXXXXX  // input Sign, Exponent, Mantissa, X=garbage

Shift arithmetically to the right by 3 (this keeps the sign bit where it needs to be):

SSSSEEEE EMMMMMMM MMMXXXXX XXXXXXXX 

Mask away excessive sign bits and garbage at the bottom (with 0b1000'11111'11111111111'0000000000000)

S000EEEE EMMMMMMM MMM00000 00000000

This will be a valid single precision float but the exponent will be off by 112=127-15 (the difference between the biases), i.e. you need to multiply these values by 2**112 (this may be combined with any subsequent operation, you intend to do anyway later). Note that this will also convert sub-normal float16 values to the corresponding sub-normal float32 value (which are also off by a factor of 2**112).

Untested intrinsic version:

__m256 gather_fp16(__fp16 const* fielddensity, __m256i indices){
  // subtract 2 bytes from base address to load data into high parts:
  int32_t const* base = (int32_t const*) ( fielddensity - 1);

  // Gather 32bit values.
  // Be aware that this reads two bytes before each desired value,
  // i.e., make sure that reading fielddensitiy[-1] is ok!
  __m256i d = _mm256_i32gather_epi32(base, indices, 2);

  // shift exponent bits to the right place and mask away excessive bits:
  d = _mm256_and_si256(_mm256_srai_epi32(d, 3), _mm256_set1_epi32(0x8fffe000));

  // scale values to compensate bias difference (could be combined with subsequent operations ...)
  __m256 two112 = _mm256_castsi256_ps(_mm256_set1_epi32(0x77800000)); // 2**112
  __m256 f = _mm256_mul_ps(_mm256_castsi256_ps(d), two112);

  return f;
}
chtz
  • 17,329
  • 4
  • 26
  • 56
  • As well as requiring finite, is subnormal special at all? I think maybe no. But it would be if you tried to re-scale with integer add to the exponent field instead of FP multiply. – Peter Cordes Jun 16 '20 at 22:02
  • 1
    Sub-normals should work, since the bit-shift will convert them to the corresponding float32-subnormal (which is also off by a factor of `2**122` from float16-subnormals). But I did not actually test this. If there were no sub-normal inputs, the final multiplication could indeed also be done by an integer addition. The float-multiplication has the additional advantage that it could be combined (possibly to a FMA) with some subsequent float operations. – chtz Jun 16 '20 at 22:07
  • Thanks for finding the 122-typo (I also made that in the source comments -- but the constant should be good (maybe writing `(127+127-15)<<23` would be better) – chtz Jun 16 '20 at 22:48
  • 1
    Perhaps also worth adding a comment in the code block about loading 2 bytes before every element. And in the text being more explicit about the consequence: this can break for an array that's page-aligned if it's not preceded by a mapped page, if you gather element 0. Perhaps easy to miss for novices who haven't really understood what this is doing or have thought through the wider-element consequences before. Nice idea BTW, much better than what I was thinking with vpblendw 2 vectors + vpshufb + vextracti128 to feed 2x vcvtph2ps, or some variation on that. – Peter Cordes Jun 16 '20 at 22:56