How to vectorize 2D array using neon intrinsics

Question

I am trying to add 50 to every element of a 2D array using neon intrinsic, here is my code, Is there any better way of doing it or optimizing it?

void fun(int height,int width,unsigned char array2D[][width],unsigned char *output){
uint8x16_t va,vb,res;
vb=vdupq_n_u8((unsigned char)50);
unsigned char *arr;
arr=&array2D[0][0]; //input array
//j=0;
int size=height*width;
for (i=0;i<size;i+=16)
{
    va=vld1q_u8(arr+i);
    res=vaddq_u8(va,vb);
    vst1q_u8(output+i,res);
}
 }

There are tons of improvments possible. Why don't you try to read the disassembly? I suggest studying computer architecture before digging into SIMD. (dependency, latency, etc) — Jake 'Alquimista' LEE, Mar 21 '22 at 11:10
not much can be further improved. The problem is *memory bound* anyway. Maybe processing 4 vectors rather than 1 to handle the whole cache line at once (may give 10-20% speedup). Consider using OpenMP, add `#pragma omp parallel for` before `for` loop and `-fopenmp` to the compiler cmdline — tstanisl, Mar 21 '22 at 12:21
Don't forget tail code to cope if you don't know that height*width is a multiple of 16 — BenClark, Mar 30 '22 at 11:01

score 0 · Answer 1 · answered Sep 01 '22 at 02:53

As @tstanisl said, the operation is memory bound thus not much can be further improved. There are some possible ways worth a try though, like:

Unroll the loop with larger factor, processing as much vectors rather than 1 until register spilling occurred.
Using OpenMP to parallelize the loop if you have more CPU cores available.
Using pld to prefetch the input array(__builtin_prefetch() in intrinsic), and choose an suitable prefetch offset carefully and empirically.
Remember to turn on compiler option -mcpu=cortex-xx if you exactly knows that your compile target is fixed.

How to vectorize 2D array using neon intrinsics

1 Answers1