1

I am trying to add 50 to every element of a 2D array using neon intrinsic, here is my code, Is there any better way of doing it or optimizing it?

void fun(int height,int width,unsigned char array2D[][width],unsigned char *output){
uint8x16_t va,vb,res;
vb=vdupq_n_u8((unsigned char)50);
unsigned char *arr;
arr=&array2D[0][0]; //input array
//j=0;
int size=height*width;
for (i=0;i<size;i+=16)
{
    va=vld1q_u8(arr+i);
    res=vaddq_u8(va,vb);
    vst1q_u8(output+i,res);
}
 }
R1608
  • 23
  • 4
  • There are tons of improvments possible. Why don't you try to read the disassembly? I suggest studying computer architecture before digging into SIMD. (dependency, latency, etc) – Jake 'Alquimista' LEE Mar 21 '22 at 11:10
  • not much can be further improved. The problem is *memory bound* anyway. Maybe processing 4 vectors rather than 1 to handle the whole cache line at once (may give 10-20% speedup). Consider using OpenMP, add `#pragma omp parallel for` before `for` loop and `-fopenmp` to the compiler cmdline – tstanisl Mar 21 '22 at 12:21
  • Don't forget tail code to cope if you don't know that height*width is a multiple of 16 – BenClark Mar 30 '22 at 11:01

1 Answers1

0

As @tstanisl said, the operation is memory bound thus not much can be further improved. There are some possible ways worth a try though, like:

  1. Unroll the loop with larger factor, processing as much vectors rather than 1 until register spilling occurred.
  2. Using OpenMP to parallelize the loop if you have more CPU cores available.
  3. Using pld to prefetch the input array(__builtin_prefetch() in intrinsic), and choose an suitable prefetch offset carefully and empirically.
  4. Remember to turn on compiler option -mcpu=cortex-xx if you exactly knows that your compile target is fixed.
user8510613
  • 1,242
  • 9
  • 27