2

Why there is no NEON intrisic to perform Signed saturating Rounding Doubling Multiply, like there is for signed 16-bit integers (vqrdmulhq_s16) ? More generally, there are only a few intrisics to perform multiplication of 8-bit integers. Is there any particular reason behind the lack of these commands?

Right now, the only solution I can think of, is to split the int8x16 vector, and perform each int8x8 multiplication seperately, by casting int8x8 to int16x8.

ilp
  • 21
  • 4
  • I think [that neon instruction](https://developer.arm.com/architectures/instruction-sets/intrinsics/vqrdmulhq_s16) is the same operation as [x86 `pmulhrsw`](https://www.felixcloutier.com/x86/pmulhrsw), which also only exists in 16-bit form. (Although x86 SIMD is notorious for only having some operations in some sizes in general, like only one 8-bit multiply which has horizontal summing of pairs.) – Peter Cordes May 06 '23 at 17:11
  • Anyway, that specific scaling and right-shift with rounding is primarily aimed at speeding up 16-bit audio, and maybe video color-space transformation as mentioned in [Expected speedup from the use of SSSE3 on an Intel machine](https://stackoverflow.com/q/13232668) / [Why on earth would I want to use PMULHRSW/VPMULHRSW?](https://stackoverflow.com/q/73942946) . Generally yes, if you want to deal with 8-bit integers, sometimes the best option is to unpack to 16-bit and re-pack later. (e.g. odd/even elements.) – Peter Cordes May 06 '23 at 17:13
  • signed q15 fixed point is the de facto standard in signal processing, hence it deserves some dedicated instructions. There are even some in `aarch32` arm instructions such as `smulw` – Jake 'Alquimista' LEE May 07 '23 at 04:24
  • Thanks for your answers! I was hopping I would be able to benefit from 16 parallel multiplications. I am trying to accelerate an algorithm. So int16x8 is probably my best option. – ilp May 07 '23 at 09:55

0 Answers0