0

I'm optimising the following code with AVX and want to know your opinion about the best approach.

There are two blocks of data uint8 x[3][3]; uint8 y[3][3]; result is uint8 value which is sum of multiplication of corresponding elements like

res = (x[0][0]*y[0][0] + x[0][1]*y[0][1] + ... + x[3][3]*y[3][3]) >> NN

my concerns are:

  • the result of x[0][0]*y[0][0] is uint16, so before any multiplications I need to unpack uint8 into uint16 which is extra instructions.

  • the sum is uint32value, so before the merging multiplication results I need to unpack uint16 into uint32. It's also overhead.

Is the any simpler/faster way to do the same math without extra unpack instructions?

Is there a way to multiply bytes and get uint32 or uint16 result w/o extra data conversions?

Thanks.

PS: x[3][3] and y[3][3] are both in a range [0...255]

user3124812
  • 1,861
  • 3
  • 18
  • 39
  • 1
    The first step could be done by `(v)PUNPCKLBW`, the second by `(V)PMADDWD` + horizontal add. If you didn't have the requirement of `uint16_t`-intermediaries, you might ba able to instead use `(V)PMADDUBSW` for the first step, but the second step would be more complicated. – EOF May 10 '16 at 14:48
  • PMADDUBSW is multiplication of signed & unsigned bytes, not unsigned & unsigned. Unpacking (PUNPCKLBW) is exactly what I want to avoid. – user3124812 May 11 '16 at 03:27
  • If you know that one of your sources of bytes are signed positive, or unsigned and less than 0x7F, then you can use `pmaddubsw`. Otherwise not. I forget if AVX-512 has something. Welcome to the joys of SSE/AVX's highly non-orthogonal choice of instructions. I often happens that the perfect operation is available, but not for the element size or signedness you need. You're probably going to need a `punpck` or `pmovzx`. – Peter Cordes May 11 '16 at 05:03

0 Answers0