For loop optimization in c

Question

how can I optimize this for loop:

for (l = 1; l <= loop; l++) {
    for (i = 1; i < n; i++) {
        x[i] = z[i] * (y[i] - x[i - 1]);
    }
}

and how can I parallel original and optimized version of it by OpenMp?

Possibly duplicate of [algorithm-to-optimize-nested-loops](https://stackoverflow.com/questions/7457879/algorithm-to-optimize-nested-loops) — ntshetty, Apr 12 '18 at 05:37
@ThiruShetty: No.it is deferent : optimization and parallelization — AComputer, Apr 12 '18 at 05:44
This can be rewritten as a prefix/cumulative sum. I only know how to scale the prefix sum with with the number of memory controllers (NUMA), not the number of cores https://en.wikipedia.org/wiki/Prefix_sum#Parallel_algorithm. — Z boson, Apr 13 '18 at 08:29
It could be interesting to look into this again with GPU offloading. — Z boson, Apr 13 '18 at 08:33

dvhh · Accepted Answer · 2018-04-12T05:53:29.637

4

Assuming you want to parallelize the inner loop

for ( i = 1; i < n; ++i ) {
    x[i] = z[i] * ( y[i] - x[i - 1] );
}

I would suggest pre-computing the part that are not dependent on the previous loop. which is easier to parallelize.

double preComps [n];
#pragma omp parallel for
for( i = 1; i < n ; ++i ) {
    preComps[i] = z[i] * y[i];
}

// this loop is difficult to parallelize because of the data dependency on what was computed in the previous loop
for( i = 1; i < n ; ++i ) {
    x[i] = preComps[i] - z[i] * x[i - 1];
}

edited Apr 12 '18 at 05:53

answered Apr 12 '18 at 05:43

dvhh

4,724
27
33

As far as I can see you just made it more complex. The critical loop has the same complexity as before but you added another loop that maybe parallelized. Resulting in more memory usage and more computations. – Kami Kaze Apr 12 '18 at 06:42
The whole process is (very likely to be) memory bound. Even if your pre-computation of some indexes in parallel were to be sped-up by its parallelization (which is yet to be proven), the number of memory references in the actual loops being equal as before, this one will be as costly as the initial one. All in all, your method will take more time and more resources than the initial. – Gilles Apr 12 '18 at 08:32
@KamiKaze, in order to save time, you _do have to parallelize_ as much as possible. The aproach leads to a win in almost 100% more speed, as seen in my answer. I'ts more compex, of course... but it saves time (50% approx) – Luis Colorado Apr 14 '18 at 11:49
@LuisColorado Given but only if there is something to parallelize, this does not change the critical path at all, save for a possible MAC operation. See what Gilles said. – Kami Kaze Apr 16 '18 at 06:52
The critical path (as shown in my answer) allows for a two thread of parallel calculations, so you get a 50% time savins for the whole thing. Not considering that the outer loop can be completely eliminated in almost all cases. – Luis Colorado Apr 16 '18 at 08:00

Luis Colorado · Answer 2 · 2018-04-16T11:28:58.677

OUTER LOOP (no aliasing assumed, see below for aliasing permitted)

As you make no references to the outer loop control variable l and you don't even reference the same terms of the assignment and no lateral effects are made in the inner loop, running the inner loop is idempotent, there's no benefit of running it once or more than once, so a very good optimisation is to eliminate it completely :), as the following example shows:

pru.c

00001: #include <stdio.h>
00002: #include <stdlib.h>
00003: #define n 10
00004: #define loop 30


00005: void print(int x[], int y[], int z[])
00006: {
00007:      int i;
00008:      printf("%12s%12s%12s%12s\n","i", "x[]", "y[]", "z[]");
00009:      for(i = 0; i < n; i++)
00010:              printf("%12d%12d%12d%12d\n", i, x[i], y[i], z[i]);
00011: }
00012: int main()
00013: {
00014:      int x[n], y[n], z[n];
00015:      int i, l;
00016:      for(i = 0; i < n; i++) {
00017:              x[i] = rand();
00018:              y[i] = rand();
00019:              z[i] = rand();
00020:      }

print t the beginning

00021:      print(x, y, z);

next is the posted loop:

00022:      for (l = 1; l <= loop; l++) {
00023:              printf("iteration %d\n", l);

00024:         for (i = 1; i<n; i++) {
00025:              x[i] = z[i] * (y[i] - x[i - 1]);
00026:         }

and print it

00027:              print(x, y, z);

00028:      }

end of posted loop

00029: }

as you see, no difference of array contents between passes on the loop. Next is a run of the program to demonstrate:

initial contents:

$ a.out
           i         x[]         y[]         z[]
           0       33613   564950497  1097816498
           1  1969887315   140734212   940422543
           2   202055087   768218108   770072198
           3  1866991770  1647128879    83392682
           4  1421485336   148486083   229615973
           5   127561358   735081006    33063457
           6  1646757679   287085223  1793088605
           7   802182690   382151770  1848710666
           8  1486775472   115658218   394986197
           9   661076908  1786703631   864107022

first iteration:

iteration 1
           i         x[]         y[]         z[]
           0       33613   564950497  1097816498
           1 -1607135687   140734212   940422543
           2  1213242898   768218108   770072198
           3 -1987622590  1647128879    83392682
           4 -1113079323   148486083   229615973
           5  -327431319   735081006    33063457
           6   407021958   287085223  1793088605
           7  1996444744   382151770  1848710666
           8   500660170   115658218   394986197
           9   -84727866  1786703631   864107022
iteration 2
           i         x[]         y[]         z[]
           0       33613   564950497  1097816498
           1 -1607135687   140734212   940422543
           2  1213242898   768218108   770072198
           3 -1987622590  1647128879    83392682
           4 -1113079323   148486083   229615973
           5  -327431319   735081006    33063457
           6   407021958   287085223  1793088605
           7  1996444744   382151770  1848710666
           8   500660170   115658218   394986197
           9   -84727866  1786703631   864107022
iteration 3
           i         x[]         y[]         z[]
           0       33613   564950497  1097816498
           1 -1607135687   140734212   940422543
           2  1213242898   768218108   770072198
           3 -1987622590  1647128879    83392682
           4 -1113079323   148486083   229615973
           5  -327431319   735081006    33063457
           6   407021958   287085223  1793088605
           7  1996444744   382151770  1848710666
           8   500660170   115658218   394986197
           9   -84727866  1786703631   864107022

... and the iterations repeat until

iteration 30
           i         x[]         y[]         z[]
           0       33613   564950497  1097816498
           1 -1607135687   140734212   940422543
           2  1213242898   768218108   770072198
           3 -1987622590  1647128879    83392682
           4 -1113079323   148486083   229615973
           5  -327431319   735081006    33063457
           6   407021958   287085223  1793088605
           7  1996444744   382151770  1848710666
           8   500660170   115658218   394986197
           9   -84727866  1786703631   864107022
$ _

INNER LOOP

if you reorder the inner expression, you can get some benefit in the inner loop also, as

x[0]
 \----.
      |
x[1] <+- y[1], z[1]
  \---.
      |
x[2] <+- y[2], z[2]
  .
  .
  .
x[n-1]<+- y[n-1],z[n-1]
   \--.
      |
x[n] <+- y[n], z[n]

If you rearrange the expression as x[i] = z[i]*y[i] - z[i]*x[i-1], you can parallelize, all the calculations of z[i]*y[i], and also the calculation of z[i]*x[i-1] as soon as the value of x[i-1] is calculated, gaining more time in the calculation of the inner loop.

 thrd[0]   thrd[1]    thrd[2]      ... thrd[j]   ...  thrd[n]
============================================================
z[1]*x[0]  z[1]*y[1]     z[2]*y[2] ... z[j]*y[j] ... z[n-1]*y[n-1]
    |          |             |             |                |        
    \----------+-------.     |             |                |
           ,---'       |     |             |                |
           |           |     |             |                |
           V           V     |             |                |
x[1] = z[1]*y[1] - z[1]*x[0] |             |                |
  |                          |             |                |
  `--------------------.     |             |                |
                       |     |             |                |
           ,-----------+-----'             |                |
           |           |                   |                |
           V           V                   |                |
x[2] = z[2]*y[2] - z[2]*x[1]               |                |
  |                                        |                |
  `--------------------.                   |                |
           ,-----------+-------------------'                |
           |           |                                    |
          ...         ...                                   |
           V           V                                    |
x[j] = z[j]*y[j] - z[j]*x[j-1]                              |
...                                                         |                    
 |                                                          |
 `---------------------------.                              |
                             |                              |
               ,-------------+------------------------------'
               |             |
               V             V
x[n-1] = z[n-1]*y[n-1]-z[n-1]*x[n-2]

this can be efficiently calculated with a pool of two threads. Previously you had n-1 products and n-1 subtractions, now you have 2*n products and n-1 subtractions, calculated in parallel, so no savings in the end you get from this approach (and you get two threads working, thanks to KamiKaze that showed me the mistake)

considering ALIASING

As can be seen from the previous graph, calculations of inner loop only depend on x[0], y[0...n-1] and z[0...n-1], and the only dependence of crossed values is given by the expression x[1] = f(x[0],z[1],y[1]). If you check... if we alias x with z or with y, then the expression transforms into x[j] = f(x[j-1],x[j], y[j]) or x[j] = f(x[j-1],z[j],x[j]), and it makes the value of x[j] in general to depend on the previous value of x[j]. In those cases (x aliased with y or z, or both) the algorithm is not idempotent, and the external loop cannot be eliminated. In the case of only aliasing y with z the expression is x[j] = f(x[j-1], y[j]) (or x[j] = f(x[j-1], z[j])) so no dependency exist on previous values, and the algorithm is idempotent.

So, in conclusion, in case of allowing aliasing between x vector and any of y or z, the outer loop cannot be eliminated, and must be conserved. In case of aliasing of y and z the algorithm continues to be idempotent, and the outer loop is not necessary.

As implied here, if the outer loop simply repeated the inner loop calculation (not clear that this is so), a compiler could cut the outer loop down to a single iteration, but with openmp parallelization such an optimization might be prevented. — tim18, Apr 13 '18 at 11:48
@tim18, right! you can get even a worse resource waste, making several processors busy calculate a thing that is idempotent. — Luis Colorado, Apr 14 '18 at 10:41
please don't post code with line numbers it makes it tedious to try your solution. — Kami Kaze, Apr 16 '18 at 06:18
your code might produce undefined behaviour. Multiplying two intergers in the range of integer might lead to a value greater than int. — Kami Kaze, Apr 16 '18 at 06:35
Your inner loop still has the same critical path as before. You have to wait for the previous iteration for the next calculation. You might have the result of your precalculated multiplacation but you still have to multiply and substract in the critical path which yields no benefit. The exception would be an MAC operation(see my answer) — Kami Kaze, Apr 16 '18 at 06:48
Sorry, next time i'll post it with linenumbers commented... so you are happy. — Luis Colorado, Apr 16 '18 at 08:04
@KamiKaze, I used random numbers to illustrate that there's no difference in using the outer loop. Of course, some of the operations can overflow with this approach, leading to U.B. But I think seriously I you only try to pursue my solution or to aport some positive critic to the code. And I'm in doubt at the end. Do you feel bad as not having been the first to provide it? In a production environment, I see no use on filling three arrays of integers just to multiply them many times with no final sense. — Luis Colorado, Apr 16 '18 at 08:07
And integer overflow does not lead to U.B. (even signed) but implementation defined behaviour (which is different) — Luis Colorado, Apr 16 '18 at 08:10
I do not feel bad about, not seeing the outer loop as redundant ( I credited it to you in my answer). If you think the first two comments were out of vengence your assesment was wrong. The line number thing bugged me because it made it hard for me to copy your code to try it and it is a custom here to not do so afaik. The UB (see c99 3.4.3 http://port70.net/~nsz/c/c11/n1570.html#3.4.3p3) as just an information and no critique on the validity of the answer. My only problem about this answer is that the critical path is still the same length so there is no speedup. — Kami Kaze, Apr 16 '18 at 08:46
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/169059/discussion-between-kami-kaze-and-luis-colorado). — Kami Kaze, Apr 16 '18 at 08:51

Kami Kaze · Answer 3 · 2018-04-16T08:51:25.230

0

The outer loop is just noise, it does the same all the time, as Luis stated in his answer. So we don't have to consider this (or can massivly parallize it, but doing it once would be enough)....

The inner loop depends on the previous cycle(x[i-1]).

A precomputation of z * y changing the inner loop to x[i] = preComps[i] - z[i] * x[i - 1]; (as proposed by dvhh) might yield a benefit, when a Multiply-Accumulate instruction is used, but this is implementation-specific and I don't know the gain of this.

If there are 0s in z[] then there would be a possibility to parallelize. Because for z[i] = 0 -> x[i] = 0 giving you the possibility to split the inner loop at this point as

x[i+1] will always be equal y[i+1] * z[i+1] which is known at any moment. Giving you an entry point for another loop.

edited Apr 16 '18 at 08:51

answered Apr 12 '18 at 07:14

Kami Kaze

2,069
15
27

Particularly in the original version, a declaration like float *__restrict x to assert no aliasing between x[] and y[] and z[] could help by permitting the loads of y and z to occur in parallel with computation of previous x values. As previously asserted, there is no evident value in openmp here. – tim18 Apr 12 '18 at 11:21
@tim18, I have to check this, but the calculations on the inner loop depend only of the values of previous array values (for array `x` only), and neither `x[0]`, `x[0]`, or `z[0]` are touched by the inner loop... So, probably, even in the case of aliasing, running the inner loop twice has no effect on the output. But, at least for the moment, that's only a hypothesis. I guess that only in the case of array overlapping, you can get a difference. – Luis Colorado Apr 14 '18 at 12:04
@LuisColorado You are right about the outer loop I misjudged the impact of `x[i-1]`. `x[0]` never changes so the starting point is the same and all values in `x` except `x[0]` are not evaluated. (rewrote the comment as it had a few mistaktes) – Kami Kaze Apr 16 '18 at 08:50

For loop optimization in c

3 Answers3

OUTER LOOP (no aliasing assumed, see below for aliasing permitted)

pru.c

INNER LOOP

considering ALIASING