I am trying to speed up a nested for loop via openmp & unrolling but it goes slow I wonder why?

Question

I am trying to speed up a simple nested loop:

for (int k = 0; k < n; k++)
    for (int i = 0; i < n - k; ++i)
      c[k] += a[i + k] * b[i];

first I tried to use openmp(since this loop is not well balanced, so I added a little modification)

#pragma omp parallel for
  for (int k = 0; k < n/2; k++)
    for (int i = 0; i < n - k; ++i){
      c[k] += a[i + k] * b[i];
      if(i < k+1) c[n-1-k] += a[i + n-1-k] * b[i];
    }
for(int k = n/2; k < n - n/2; k++)
  for (int i = 0; i < n - k; ++i)
    c[k] += a[i + k] * b[i];

But the problem is it slows down compared with just adding #pragma omp parallel for. So I guessed probably it was related to the reuse of cache, then I tried unrolling:

#pragma omp parallel for
    for (k = 0; k < n/2-7; k+=8){
        for (int i = 0; i < n - k; ++i){
            c[k] += a[i+k] * b[i];
            if(i < n-k-1)   c[k+1] += a[i+k+1] * b[i];
            if(i < n-k-2)   c[k+2] += a[i+k+2] * b[i];
            if(i < n-k-3)   c[k+3] += a[i+k+3] * b[i];
            if(i < n-k-4)   c[k+4] += a[i+k+4] * b[i];
            if(i < n-k-5)   c[k+5] += a[i+k+5] * b[i];
            if(i < n-k-6)   c[k+6] += a[i+k+6] * b[i];
            if(i < n-k-7)   c[k+7] += a[i+k+7] * b[i];
            if(i < k+1)    c[n-1-k] += a[i+n-1-k] * b[i];
            if(i < k+2)    c[n-2-k] += a[i+n-2-k] * b[i];
            if(i < k+3)    c[n-3-k] += a[i+n-3-k] * b[i];
            if(i < k+4)    c[n-4-k] += a[i+n-4-k] * b[i];
            if(i < k+5)    c[n-5-k] += a[i+n-5-k] * b[i];
            if(i < k+6)    c[n-6-k] += a[i+n-6-k] * b[i];
            if(i < k+7)    c[n-7-k] += a[i+n-7-k] * b[i];
            if(i < k+8)    c[n-8-k] += a[i+n-8-k] * b[i];
        }

    }
    // this loop must <= 16 and is well balance
    #pragma omp parallel for
    for(int j = k; j < n-k; j++)
        for(int i = 0; i < n - j; ++i){
            c[j] += a[i + j] * b[i];
        }

But...it even get worse! I just want to know why

more: I compiled it via g++-9 test.cpp -openmp -o test

The first modification you made seems to apply the work sharing only for the first k-for loop (the one with ```k < n/2;```). The other one is not considered for work-sharing and is outside the ```#pragma omp for``` construct. — User 10482, May 30 '20 at 05:50
yes, only the outer loop is shared but I think the second one is not outside #pragma. What's more, with ``` #pragma omp parallel for for (int k = 0; k < n; k++) for (int i = 0; i < n - k; ++i) c[k] += a[i + k] * b[i]; ``` the performance does improve. — Cino, May 30 '20 at 05:57
It does seem so. The ```#pragma omp for``` applies to the first k-loop block only. Try adding another ```#pragma omp for``` for the next k-loop as well and check. Otherwise, you have half computation parallelized and half in serial. — User 10482, May 30 '20 at 06:03
Actually, Ive tried this but it helps a little (still worse than only `#pragma omp for`) — Cino, May 30 '20 at 06:23
May I suggest playing with the ```schedule``` clause rather than splitting the loop into parts? It would be hard for you to predict proper load balancing unless the load pattern is very obvious. Sidenote: The condition in second k-loop is still ```n-n/2 = n/2```. I don't think that's what you intended. — User 10482, May 30 '20 at 06:28
LOL, you solve my first question! thanks so much! could you tell me more details about it? why I can not schedule by myself?? and if I still unroll the loop can I get better performance? — Cino, May 30 '20 at 07:00
How big is `n` for a typical execution of the program ? And how are you timing execution ? — High Performance Mark, May 30 '20 at 07:06
Ive tested it for several times. Approximately n = 10^5, and with schedule the speedup ratio is about x1.8(benchmark is `#pragma omp for`) — Cino, May 30 '20 at 07:11
I recommend using an optimization compiler flag, `-O3` or `-O2`, when testing performance. — RHertel, May 30 '20 at 07:21
yeah, I added `-O2`. But I do not want to use `-O3`, cause it is too aggressive. — Cino, May 30 '20 at 07:26

I am trying to speed up a nested for loop via openmp & unrolling but it goes slow I wonder why?

0 Answers0