I'm optimising the following code with AVX and want to know your opinion about the best approach.
There are two blocks of data
uint8 x[3][3];
uint8 y[3][3];
result is uint8
value which is sum of multiplication of corresponding elements like
res = (x[0][0]*y[0][0] + x[0][1]*y[0][1] + ... + x[3][3]*y[3][3]) >> NN
my concerns are:
the result of
x[0][0]*y[0][0]
isuint16
, so before any multiplications I need to unpackuint8
intouint16
which is extra instructions.the sum is
uint32
value, so before the merging multiplication results I need to unpackuint16
intouint32
. It's also overhead.
Is the any simpler/faster way to do the same math without extra unpack instructions?
Is there a way to multiply bytes and get uint32
or uint16
result w/o extra data conversions?
Thanks.
PS: x[3][3] and y[3][3] are both in a range [0...255]