I am writing vector code with RISC-V intrinsics for extension V vectors, but this question probably applies to vectorisation generally.
I need to multiply and accumulate lots of uint8
values. To do this I want to fill the vector registers with uint8
s, multiply and accumulate (MAC) in a loop, done. However in order to avoid overflowing the result of the accumulation would normally have to be stored in a larger type eg uint32
. How does this extend to vectors?
I imagine I have to split the vector registers into 32-bit lanes and accumulate into them, but writing vectorised code is new to me. Is there a way I can split the vector registers into 8-bit lanes for better parallelism, and still avoid the overflow?
A problem arises because I fill a vector register by providing a pointer to an array of uint8
vuint8m1_t vec_u8s = __riscv_vle64_v_u8m1(ptr_a, vl);
but if I were to replace this with...
vuint32m1_t vec_u8s_in_32bit_lanes = __riscv_vle64_v_u32m1(ptr_a, vl);
It may read from my array as 32 bit values, reading 4 (uint8) elements into one (uint32) lane. Is my understanding correct? How should I avoid this?
Is it ok because ptr_a is defined as uint8_t * ptr_a ...
?
Edit:
Perhaps what im looking for is
vint32m1_t __riscv_vlse32_v_i32m1_m (vbool32_t mask, const int32_t *base, ptrdiff_t bstride, size_t vl);
where I can set the mask to 0xFF and stride to 1 to read data at 1 byte increments ?