How to multiply-accumulate unsigned bytes into 32-bit elements without overflow with RISC-V extension "V" SIMD vectors?

Question

I am writing vector code with RISC-V intrinsics for extension V vectors, but this question probably applies to vectorisation generally.

I need to multiply and accumulate lots of uint8 values. To do this I want to fill the vector registers with uint8s, multiply and accumulate (MAC) in a loop, done. However in order to avoid overflowing the result of the accumulation would normally have to be stored in a larger type eg uint32. How does this extend to vectors?

I imagine I have to split the vector registers into 32-bit lanes and accumulate into them, but writing vectorised code is new to me. Is there a way I can split the vector registers into 8-bit lanes for better parallelism, and still avoid the overflow?

A problem arises because I fill a vector register by providing a pointer to an array of uint8

vuint8m1_t vec_u8s = __riscv_vle64_v_u8m1(ptr_a, vl);

but if I were to replace this with...

vuint32m1_t vec_u8s_in_32bit_lanes = __riscv_vle64_v_u32m1(ptr_a, vl);

It may read from my array as 32 bit values, reading 4 (uint8) elements into one (uint32) lane. Is my understanding correct? How should I avoid this?

Is it ok because ptr_a is defined as uint8_t * ptr_a ... ?

Edit:

Perhaps what im looking for is

vint32m1_t __riscv_vlse32_v_i32m1_m (vbool32_t mask, const int32_t *base, ptrdiff_t bstride, size_t vl);

where I can set the mask to 0xFF and stride to 1 to read data at 1 byte increments ?

*this question probably applies to vectorisation generally.* - on x86, you'd use `psadbw` (sum of absolute differences) against a zeroed vector to accumulate sums of 8 bytes without overflow. On AArch64 there's a horizontal sum instruction, IIRC, which may have some hardware support as well as microcoded multiple ops through the pipeline to do the shuffling and summing. I haven't looked at anything for RISC-V extension V, but perhaps it has a multiply-accumulate of bytes into 16-bit elements or something? Like x86's `pmaddwd` or `pmaddubsw` which are useful for this kind of thing. — Peter Cordes, Apr 03 '23 at 23:39
BTW, I added a link to https://github.com/riscv-non-isa/rvv-intrinsic-doc/blob/master/rvv-intrinsic-rfc.md for documentation on the intrinsics. I have no idea whether that's an authoritative or up-to-date source for RISC-V extension "V" stuff. — Peter Cordes, Apr 04 '23 at 01:35
RISC-V extension V (according to the official 1.0 spec: https://github.com/riscv/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf) has `vwredsumu.vs` to do a widening `+` reduction of unsigned elements. Or if you need to avoid overflow even in the multiply itself, `vwmaccu.vx vd, rs1, vs2, vm` is a Widening unsigned-integer multiply-add. That would produce 16-bit elements, so you couldn't add further without risk of overflow (or `2 * 0xff^2` can already overflow?), so I guess you'd have to widen again with `vwredsumu`. So I think you'd be looking for intrinsics for those. — Peter Cordes, Apr 04 '23 at 02:47
clang can auto-vectorize a dot-product of unsigned bytes: https://godbolt.org/z/Eezxz6YE4 shows `clang -O3 -march=rv64gv1p0` (extension V 1.0 = v1p0) using `vzext.vf4 v12, v10` on both inputs separately to feed `vmacc.vv` inside the loop with `vredsum.vs` only outside the loop, which is maybe not optimal if it would be possible to use narrower elements longer without overflow. — Peter Cordes, Apr 04 '23 at 03:23
I think you've got it with `vzext`. The "V" spec section 11.3 has `vzext.vf4 vd, vs2, vm # Zero-extend SEW/4 source to SEW destination`. The godbolt output implies reading the uint8s into vector registers, extend those vectors then MAC. It's challenging to find the correct intrinsics to use because of the "casing" involved... There doesnt seem to be an intrinsic to extend and change from unsigned to signed type. Godbolt seems to read in uint8s as int8s and then use `vzext` which will zero extend them, negating the effective cast to int8. — confusedandsad, Apr 04 '23 at 15:11
Perhaps I can literally just cast the vector type from i8 to u8, I will experiment and post results. Edit: you cant cast them, but there are reinterpret cast intrinsics — confusedandsad, Apr 04 '23 at 15:12
Do vector registers even have a signedness? There are signed vs. unsigned instructions for every case where it matters, like when widening (`vwmaccu` vs. `wvmacc`, just like scalar `mul` vs. `mulu` or `vwredsum[u]`). Same-width integer operations other than division or arithmetic right shift don't care about the meaning of the MSB, unsigned is the same binary operation as 2's complement for add/sub/mul. Or wait, you said intrinsic. I didn't look as much at any intrinsics doc since I wasn't as confident I'd found a current / official one; the vector types have signedness? — Peter Cordes, Apr 04 '23 at 18:08
Yeah since I cant use auto-vectorization in my case I have to use the intrinsics to generate the correct assembly, they are a little picky about types and wont let me provide an eg vector u8 type to an intrinsic function expecting a vector i8 type. I was also generating `vsext` instructions instead of `vzext` which I think would change the behaviour. I'm still testing though. Heres where I am currently: https://godbolt.org/z/dr5Ere78e — confusedandsad, Apr 04 '23 at 19:24
The types shouldnt matter really like you say, but they do in the API if my understanding is correct. I suppose inline assembly is also an option, as the reinterpretation intrinsics do seem a bit suboptimal here — confusedandsad, Apr 04 '23 at 19:27
You forgot to enable optimization! No wonder your intrinsics code compiled to total garbage, including insane stuff like `li a0, 6` / `mul reg, reg, a0` instead of a shift. https://godbolt.org/z/e5dEfjqPG shows much more reasonable asm inside the loop. An unfortunate `vsetvli` which the auto-vectorized version avoids, but otherwise similar. IDK why `-fno-vectorize` isn't stopping clang trunk and 16.0 from auto-vectorizing the scalar code, but it's actually useful to have that to compare with. — Peter Cordes, Apr 04 '23 at 20:29
You're right, much clearer now. Now I just have to debug the code because it doesn't quite work! As I said, I'm new to this. Thanks so much for all your help — confusedandsad, Apr 04 '23 at 20:34
Does your code work for lengths that are a multiple of 16 or 32? Clang's auto-vectorization just uses scalar for the last few elements that aren't a multiple of the vector width, not taking advantage of RVV's masking. (This is simpler but potentially inefficient if there are many leftover elements, especially for AVX-512 on x86 which also supports masking). Is that what the `vsetvli` is doing inside your main loop, handling the case where this iteration might be a final partial vector? I wonder if that's efficient, or if it's better to do that as a separate final iteration. — Peter Cordes, Apr 04 '23 at 20:49
Anyway, good luck, and I'd encourage you to post an answer once you figure something out; RVV is very new and there probably aren't any Q&As about it. Probably not even a tag, I'll think about whether [rvv] would be a good idea; short tag names can often collide with other techs, like [sse] often gets mis-tagged on [server-sent-events] questions. But [riscv-v] or [riscv-extension-v] are clunky. For now we have [riscv][simd] as a pair of tags; hmm should probably add [intrinsics] since that's what you're asking about. — Peter Cordes, Apr 04 '23 at 20:51
It works up to 16 elements. Thats what the setvl is *supposed* to be doing, I copied that from one of the rvv examples. Not sure if its working, and not sure how best to debug at this point. — confusedandsad, Apr 04 '23 at 20:54
I will post a full answer and example when i get it working! :) — confusedandsad, Apr 04 '23 at 20:55
I'd suggest single-stepping the asm and seeing what integer value is being passed to `vsetvli`, and check that the pointers are advancing the way you expect for the 2nd 16-byte chunk. And if you can get a debugger to show you the vector register contents after the loads in the 2nd iteration, that could confirm they're getting the values you expect. (Looking at the asm can avoid any potential C source-level confusion about how the intrinsics API is designed. And/or just show you the program logic you actually wrote in other terms, potentially exposing a brain fart.) — Peter Cordes, Apr 04 '23 at 20:59

score 1 · Accepted Answer · answered Apr 04 '23 at 23:31

1

The answer was to extend the width of the vector elements using the appropriate v{s;z}ext intrinsic, then use a reinterpret intrinsic on the result to "cast" its values.

Below is an example of a function and its vectorized equivalent, accounting for width/type changes.

Big thanks to Peter Cordes for helping me figure it out!

int byte_mac(unsigned char a[], unsigned char b[], int len) {
  int sum = 0;
  for (int i = 0; i < len; i++) {
    sum += a[i] * b[i];
  }
  return sum;
}

int byte_mac_vec(unsigned char *a, unsigned char *b, int len) {
  size_t vlmax = __riscv_vsetvlmax_e8m1();
  vint32m4_t vec_s = __riscv_vmv_v_x_i32m4(0, vlmax);
  vint32m1_t vec_zero = __riscv_vmv_v_x_i32m1(0, vlmax);
  int k = len;
  for (size_t vl; k > 0; k -= vl, a += vl, b += vl) {
    vl = __riscv_vsetvl_e8m1(k);
   
    vuint8m1_t a8s = __riscv_vle8_v_u8m1(a, vl);
    vuint8m1_t b8s = __riscv_vle8_v_u8m1(b, vl);
    vuint32m4_t a8s_extended = __riscv_vzext_vf4_u32m4(a8s, vl);
    vuint32m4_t b8s_extended = __riscv_vzext_vf4_u32m4(b8s, vl);
    
    vint32m4_t a8s_as_i32 = __riscv_vreinterpret_v_u32m4_i32m4(a8s_extended);
    vint32m4_t b8s_as_i32 = __riscv_vreinterpret_v_u32m4_i32m4(b8s_extended);

    vec_s = __riscv_vmacc_vv_i32m4_tu(vec_s, a8s_as_i32, b8s_as_i32, vl);
  }
  
  vint32m1_t vec_sum = __riscv_vredsum_vs_i32m4_i32m1(vec_s, vec_zero, __riscv_vsetvl_e32m4(len));
  int sum = __riscv_vmv_x_s_i32m1_i32(vec_sum);

  return sum;
}

answered Apr 04 '23 at 23:31

confusedandsad

90
7

1

So the only change from your earlier attempt in comments was `__riscv_vmacc_vv_i32m4_tu` inside the loop instead of `__riscv_vmacc_vv_i32m4`? What does the `_tu` mean? Tail something, from a quick skim of https://github.com/riscv-non-isa/rvv-intrinsic-doc/blob/master/rvv-intrinsic-rfc.md – Peter Cordes Apr 05 '23 at 02:46
It means tail undisturbed. Someone from SiFive told me that it's "so that the upper elements on the last iteration are preserved from previous iterations", I have found in a presentation pdf from Barcalona Supercomuting Centre: "When vl < vlmax then we have elements that are not operated • Those elements are called the tail elements RVV offers two policies here • tail undisturbed. Tail elements in the destination register are left unmodified. • tail agnostic. Can behave like tail undisturbed or, alternatively, all the bits of the tail elements of the destination register are set to 1" – confusedandsad Apr 05 '23 at 11:02
I'm still hazy on how extactly this works. Trying with/without the _tu seems to both be fine. Other changes include setting `vl` to `__riscv_vsetvl_e32m4(len)` in the `redsum` line. This is to handle the case where len < vlmax. Again this was told to me by the SiFive chap, but I think its to handle if I tried to run the function with a `len` of eg 1. – confusedandsad Apr 05 '23 at 11:04
1

Ok, Tail Undisturbed makes sense here, for a final short vector at the end of a long array. Then you need `redsum` to add all the elements across the whole vector, not just the `vl` you used for the last partial vector. So if you're already setting `vl` inside the loop, setting it again outside would be to handle the case where `len > vlmax` and you had to run multiple iterations. – Peter Cordes Apr 05 '23 at 19:01
I just found in rvv-intrinsic-api.md, "Note: Reduction intrinsics will generate code using tail undisturbed policy unless vundefined() is passed to the dest argument." Maybe this is why it made no difference to me with/without the tu. My understanding of tail undisturbed (its hard to find a clear definition) is that any vector elements beyond `vl` wont be touched by the given operation, other policies may overwrite with 1s. So here in a short vector `vl` bytes are read and `zext`ed, but if there are less than `vl` then because of `_tu` 0s rather than 1s will get propagated to the `vacc`? – confusedandsad Apr 05 '23 at 21:01
1

I don't know how loads work, but that doesn't really matter (other than false dependencies and having to merge) if `vl` counts in elements not bytes. The `vmacc` will multiply+add into the low `vl` elements of `vec_s`, leaving the higher elements of `vec_s` unmodified, still holding the sums from earlier iterations when `vl` was higher. That's the "tail", and it's staying unmodified. It doesn't matter what happened in the high parts of the load and zext temporary results because we're not keeping those, masking the MAC operation makes them not affect it. – Peter Cordes Apr 05 '23 at 21:08

How to multiply-accumulate unsigned bytes into 32-bit elements without overflow with RISC-V extension "V" SIMD vectors?

1 Answers1