How to optimize the computation of a for loop using SIMD?

Question

I am trying to accelerate a stereo matching algorithm on ODROID XU4 ARM platform using Neon SIMD. For this puropose I am using openMp's pragmas.

 void StereoMatch:: sadCol(uint8_t* leftRank,uint8_t* rightRank,const int SAD_WIDTH,const int SAD_WIDTH_STEP, const int imgWidth,int j, int d , uint16_t* cost) 
  {

   uint16_t sum = 0;
   int n = 0;
   int m =0;
      for ( n = 0; n < SAD_WIDTH+1; n++)
      {

     #pragma omp simd
     for(  m = 0; m< SAD_WIDTH_STEP; m = m + imgWidth ) 
         {


        sum += abs(leftRank[j+m+n]-rightRank[j+m+n-d]);

         };
         cost[n] = sum;
         sum = 0;



  };

I am fairly new to SIMD and openMp, I understood that using the SIMD pragma in the code will direct the compiler to vectorize the subtraction, but when I executed the code I noticed no difference. What should I add to my code in order to vectorize it ?

What flags are you compiling with? Try `-O3 -fopenmp-simd -march=native -mfpu=neon` — nemequ, May 09 '19 at 22:12
In your particular case, you will also need to use `reduction(+:sum)` with the `simd` pragma. — Michael Klemm, May 11 '19 at 11:50
I have no actual experience with Neon, but SIMD is usually not good with strided memory access. I would switch the order of the loops, i.e., accumulate to multiple `cost[n]` in the inner loop. What are typical values for `SAD_WIDTH`, `SAD_WIDTH_STEP` and `imgWidth`? — chtz, May 12 '19 at 09:07
Also, for asking here, simplify your code to something that compiles on its own (a [mcve]). No need to make a member function instead of a free function, and the `j` and `d` parameters are not really important (in fact, you could add/subtract them to `leftRank`, `rightRank` before calling your function). — chtz, May 12 '19 at 09:13
Sorry I couldn't answer back so soon @nemequ. I use the flags you mentioned above. How can I know that vectorization happened successfully ? — Taki Eddine, May 13 '19 at 07:30
typical values for SAD_WIDTH = 7, imgWidth = 320, SAD_WIDTH_STEP = SAD_WIDTH* imgWidth. I used the latter one just to avoid repetitive multiplication — Taki Eddine, May 13 '19 at 07:49
For `SAD_WIDTH=7`, the `cost` array would have exactly size 8, i.e., fit into one `uint16x8_t` register. And Neon seems to have an instruction for exactly what you need: absolute difference and accumulate: `VABAL.U8` which has the intrinsic `vabal_u8` http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0472m/chr1360928375079.html. If this is performance critical, I suggest just writing this with intrinsics (with some `#ifdef` guards for portability). If `SAD_WIDTH+1` is not a multiple of 8, I'd just calculate some additional values but ignore them before storing the end-result. — chtz, May 13 '19 at 14:11
@chtz It seems a good idea , my question is how should I compile using gcc, I tried -O3 but it didn't compile at all — Taki Eddine, May 13 '19 at 19:14
`-ftree-vectorizer-verbose=7` should give you some information, or you could just look at the disassembly. What do you mean `-O3` didn't work? Note that without `-O3` you won't have `-ftree-vectorize`, so its probably not vectorized. — nemequ, May 13 '19 at 21:55
Here is a cleaned up version on godbolt: https://godbolt.org/z/SPQ5_t. You may need to optimize this by hand using intrinsics, but you could try using pointers for the inner loop; that `-d` is definitely a good candidate for messing things up. — nemequ, May 13 '19 at 22:16
I think it worked it I was mistakenly executing the WRONG output file, vectorization has accelerated the execution of my algorithm 4 times ! — Taki Eddine, May 14 '19 at 08:34
@TakiEddine I don't think the acceleration you are experiencing is thanks to SIMD, but OpenMP doing its job distributing the workload through multiple cores. When properly written, it will be dozens of times as fast as the original one even on a single core. I wouldn't use multiple cores for such a trivial job that's bandwith limited anyway. — Jake 'Alquimista' LEE, Jun 06 '19 at 07:36
@nemequ sorry for taking this long to respond, why do you think that the -d can mess things up ? — Taki Eddine, Feb 02 '20 at 15:56

score 1 · Accepted Answer · answered Jul 17 '19 at 15:53

As said in the comments, ARM-Neon has an instruction which directly does what you want, i.e., compute the absolute difference of unsigned bytes and accumulates it to unsigned short-integers.

Assuming SAD_WIDTH+1==8, here is a very simple implementation using intrinsics (based on the simplified version by @nemequ):

void sadCol(uint8_t* leftRank,
            uint8_t* rightRank,
            int j,
            int d ,
            uint16_t* cost) {
    const int SAD_WIDTH = 7;
    const int imgWidth = 320;
    const int SAD_WIDTH_STEP = SAD_WIDTH * imgWidth;

    uint16x8_t cost_8 = {0};
    for(int m = 0; m < SAD_WIDTH_STEP; m = m + imgWidth )  {
        cost_8 = vabal_u8(cost_8, vld1_u8(&leftRank[j+m]), vld1_u8(&rightRank[j+m-d]));
    };
    vst1q_u16(cost, cost_8);
};

vld1_u8 loads 8 consecutive bytes, vabal_u8 computes the absolute difference and accumulates it to the first register. Finally, vst1q_u16 stores the register to memory.

You can easily make imgWidth and SAD_WIDTH_STEP function parameters. If SAD_WIDTH+1 is a different multiple of 8, you can write another loop for that.

I have no ARM platform at hand to test it, but "it compiles": https://godbolt.org/z/vPqiYI (and the assembly looks fine, in my eyes). If you optimize with -O3 gcc will unroll the loop.

How to optimize the computation of a for loop using SIMD?

1 Answers1