3

I wonder if there is a fast way of multiplying int8 arrays, i.e.

for(i = 0; i < n; ++i)
    z[i] = x * y[i];

I see that the Intel intrinsics guide lists several SIMD instructions, such as _mm_mulhi_epi16 and _mm_mullo_epi16 that do something like this for int16. Is there something similar for int8 that I'm missing?

MWB
  • 11,740
  • 6
  • 46
  • 91
  • 2
    You're not missing it, it really doesn't exist – harold Sep 07 '21 at 05:30
  • 1
    For arbitrary constants, you can unpack to 16-bit elements. For constants, you might be able to break it down into a shift and add or subtract. (8-bit shifts by a constant can be emulated with `_mm_slli_epi32` and `_mm_and_si128` with an appropriate mask.) – Peter Cordes Sep 07 '21 at 05:54
  • 2
    If you wanted to horizontal-sum z[] when you're done, you can use `pmaddubsw` to do 8-bit multiply -> horizontally add pairs into 16-bit accumulators. (But it's signed x unsigned so it's tricky to use unless you know one of your inputs is signed-positive.) – Peter Cordes Sep 07 '21 at 05:55
  • Well for what it's worth, compiling it with highest optimization on makes gcc and clang disassembly look like some manner of Klingon language. https://godbolt.org/z/saGWh8EnW. I would think twice before challenging that unholy mess with some manual optimization until I've done some serious benchmarking. – Lundin Sep 07 '21 at 07:02
  • @Lundin: It's just unpacking to 16-bit elements (with zero extension since it's going to truncate again) for 2x `pmullw` => `pand` / `packuswb` (pack back to bytes, with truncation so the unsigned saturation doesn't do anything). At least that's what I assume; that's a strategy that makes sense and is compatible with the instructions present. Aki's answer gets the same work done but with more efficient unpacking and re-packing. (clang's output in your link extra complicated because it's unrolling by 2 vectors. `-fno-unroll-loops` is handy for identify the auto-vec strategy.) – Peter Cordes Sep 07 '21 at 09:13
  • But yes, clang is probably not *much* worse than the best manual strategy, and that may be good enough / not worth the effort of improving, especially if it still does a good job with AVX2 (where lane-crossing unpack would be worse if not using Aki's slicing strategy), or with ARM NEON. i.e. tradeoff of dev time / maintainability vs. speedup relative to good auto-vectorization. – Peter Cordes Sep 07 '21 at 09:15

1 Answers1

4

Breaking the input into low & hi, one can

__m128i const kff00ff00 = _mm_set1_epi32(0xff00ff00);
__m128i lo = _mm_mullo_epi16(y, x);
__m128i hi = _mm_mullo_epi16(_mm_and_si128(y, kff00ff00), x);
__m128i z = _mm_blendv_epi8(lo, hi, kff00ff00);

AFAIK, the high bits YY of the YYyy|YYyy|YYyy|YYyy multiplied by 00xx|00xx|00xx|00xx do not interfere with the low 8 bits ??ll, and likewise the product of YY00|YY00 * 00xx|00xx produces the correct 8 bit product at HH00. These two results at the correct alignment need to be blended.

__m128i x = _mm_set1_epi16(scalar_x);, and __m128i y = _mm_loadu_si128(...);

An alternative is to use shufb calculating LutLo[y & 15] + LutHi[y >> 4], where unfortunately the shift must be also emulated by _mm_and_si128(_mm_srli_epi16(y,4),_mm_set1_epi8(15)).

Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • 1
    Oh nice, good idea to split high/low within 16-bit elements, instead of unpack low/high, if you want to truncate instead of using `_mm_packus_epi16` (`packuswb`). The `x` vector here is `_mm_set1_epi16(scalar_x)`, and `y` is a load from the array. (Added that in an edit for future readers while fixing a typo) – Peter Cordes Sep 07 '21 at 06:08
  • Does this really work? I'm getting some odd results – harold Sep 07 '21 at 06:14
  • The multiplication version would need to be unrolled by a factor of 3 I think, since the multiplication has a latency of 5. For the `shufb` approach unrolling by a factor of 2 should be sufficient for optimal throughput. – Aki Suihkonen Sep 07 '21 at 06:15
  • 1
    @harold: thanks, both of the multiplies need to be mullo, I was overengineering it first. I'll double check. – Aki Suihkonen Sep 07 '21 at 06:23
  • 1
    @AkiSuihkonen: the querent's use-case is for independent work. Out-of-order exec can hide `pmullw` latency across iterations since it's not part of a loop-carried dep chain, so unrolling isn't critical. Overall throughput, not latency, is the key factor for this. – Peter Cordes Sep 07 '21 at 06:31