how can I optimize this for loop:
for (l = 1; l <= loop; l++) {
for (i = 1; i < n; i++) {
x[i] = z[i] * (y[i] - x[i - 1]);
}
}
and how can I parallel original and optimized version of it by OpenMp
?
how can I optimize this for loop:
for (l = 1; l <= loop; l++) {
for (i = 1; i < n; i++) {
x[i] = z[i] * (y[i] - x[i - 1]);
}
}
and how can I parallel original and optimized version of it by OpenMp
?
Assuming you want to parallelize the inner loop
for ( i = 1; i < n; ++i ) {
x[i] = z[i] * ( y[i] - x[i - 1] );
}
I would suggest pre-computing the part that are not dependent on the previous loop. which is easier to parallelize.
double preComps [n];
#pragma omp parallel for
for( i = 1; i < n ; ++i ) {
preComps[i] = z[i] * y[i];
}
// this loop is difficult to parallelize because of the data dependency on what was computed in the previous loop
for( i = 1; i < n ; ++i ) {
x[i] = preComps[i] - z[i] * x[i - 1];
}
As you make no references to the outer loop control variable l
and you don't even reference the same terms of the assignment and no lateral effects are made in the inner loop, running the inner loop is idempotent, there's no benefit of running it once or more than once, so a very good optimisation is to eliminate it completely :)
, as the following example shows:
00001: #include <stdio.h>
00002: #include <stdlib.h>
00003: #define n 10
00004: #define loop 30
00005: void print(int x[], int y[], int z[])
00006: {
00007: int i;
00008: printf("%12s%12s%12s%12s\n","i", "x[]", "y[]", "z[]");
00009: for(i = 0; i < n; i++)
00010: printf("%12d%12d%12d%12d\n", i, x[i], y[i], z[i]);
00011: }
00012: int main()
00013: {
00014: int x[n], y[n], z[n];
00015: int i, l;
00016: for(i = 0; i < n; i++) {
00017: x[i] = rand();
00018: y[i] = rand();
00019: z[i] = rand();
00020: }
print t the beginning
00021: print(x, y, z);
next is the posted loop:
00022: for (l = 1; l <= loop; l++) {
00023: printf("iteration %d\n", l);
00024: for (i = 1; i<n; i++) {
00025: x[i] = z[i] * (y[i] - x[i - 1]);
00026: }
and print it
00027: print(x, y, z);
00028: }
end of posted loop
00029: }
as you see, no difference of array contents between passes on the loop. Next is a run of the program to demonstrate:
initial contents:
$ a.out
i x[] y[] z[]
0 33613 564950497 1097816498
1 1969887315 140734212 940422543
2 202055087 768218108 770072198
3 1866991770 1647128879 83392682
4 1421485336 148486083 229615973
5 127561358 735081006 33063457
6 1646757679 287085223 1793088605
7 802182690 382151770 1848710666
8 1486775472 115658218 394986197
9 661076908 1786703631 864107022
first iteration:
iteration 1
i x[] y[] z[]
0 33613 564950497 1097816498
1 -1607135687 140734212 940422543
2 1213242898 768218108 770072198
3 -1987622590 1647128879 83392682
4 -1113079323 148486083 229615973
5 -327431319 735081006 33063457
6 407021958 287085223 1793088605
7 1996444744 382151770 1848710666
8 500660170 115658218 394986197
9 -84727866 1786703631 864107022
iteration 2
i x[] y[] z[]
0 33613 564950497 1097816498
1 -1607135687 140734212 940422543
2 1213242898 768218108 770072198
3 -1987622590 1647128879 83392682
4 -1113079323 148486083 229615973
5 -327431319 735081006 33063457
6 407021958 287085223 1793088605
7 1996444744 382151770 1848710666
8 500660170 115658218 394986197
9 -84727866 1786703631 864107022
iteration 3
i x[] y[] z[]
0 33613 564950497 1097816498
1 -1607135687 140734212 940422543
2 1213242898 768218108 770072198
3 -1987622590 1647128879 83392682
4 -1113079323 148486083 229615973
5 -327431319 735081006 33063457
6 407021958 287085223 1793088605
7 1996444744 382151770 1848710666
8 500660170 115658218 394986197
9 -84727866 1786703631 864107022
... and the iterations repeat until
iteration 30
i x[] y[] z[]
0 33613 564950497 1097816498
1 -1607135687 140734212 940422543
2 1213242898 768218108 770072198
3 -1987622590 1647128879 83392682
4 -1113079323 148486083 229615973
5 -327431319 735081006 33063457
6 407021958 287085223 1793088605
7 1996444744 382151770 1848710666
8 500660170 115658218 394986197
9 -84727866 1786703631 864107022
$ _
if you reorder the inner expression, you can get some benefit in the inner loop also, as
x[0]
\----.
|
x[1] <+- y[1], z[1]
\---.
|
x[2] <+- y[2], z[2]
.
.
.
x[n-1]<+- y[n-1],z[n-1]
\--.
|
x[n] <+- y[n], z[n]
If you rearrange the expression as x[i] = z[i]*y[i] - z[i]*x[i-1]
, you can parallelize, all the calculations of z[i]*y[i]
, and also the calculation of z[i]*x[i-1]
as soon as the value of x[i-1]
is calculated, gaining more time in the calculation of the inner loop.
thrd[0] thrd[1] thrd[2] ... thrd[j] ... thrd[n]
============================================================
z[1]*x[0] z[1]*y[1] z[2]*y[2] ... z[j]*y[j] ... z[n-1]*y[n-1]
| | | | |
\----------+-------. | | |
,---' | | | |
| | | | |
V V | | |
x[1] = z[1]*y[1] - z[1]*x[0] | | |
| | | |
`--------------------. | | |
| | | |
,-----------+-----' | |
| | | |
V V | |
x[2] = z[2]*y[2] - z[2]*x[1] | |
| | |
`--------------------. | |
,-----------+-------------------' |
| | |
... ... |
V V |
x[j] = z[j]*y[j] - z[j]*x[j-1] |
... |
| |
`---------------------------. |
| |
,-------------+------------------------------'
| |
V V
x[n-1] = z[n-1]*y[n-1]-z[n-1]*x[n-2]
this can be efficiently calculated with a pool of two threads. Previously you had n-1
products and n-1
subtractions, now you have 2*n
products and n-1
subtractions, calculated in parallel, so no savings in the end you get from this approach (and you get two threads working, thanks to KamiKaze that showed me the mistake)
As can be seen from the previous graph, calculations of inner loop only depend on x[0]
, y[0...n-1]
and z[0...n-1]
, and the only dependence of crossed values is given by the expression x[1] = f(x[0],z[1],y[1])
. If you check... if we alias x
with z
or with y
, then the expression transforms into x[j] = f(x[j-1],x[j], y[j])
or x[j] = f(x[j-1],z[j],x[j])
, and it makes the value of x[j]
in general to depend on the previous value of x[j]
. In those cases (x
aliased with y
or z
, or both) the algorithm is not idempotent, and the external loop cannot be eliminated. In the case of only aliasing y
with z
the expression is x[j] = f(x[j-1], y[j])
(or x[j] = f(x[j-1], z[j])
) so no dependency exist on previous values, and the algorithm is idempotent.
So, in conclusion, in case of allowing aliasing between x
vector and any of y
or z
, the outer loop cannot be eliminated, and must be conserved. In case of aliasing of y
and z
the algorithm continues to be idempotent, and the outer loop is not necessary.
The outer loop is just noise, it does the same all the time, as Luis stated in his answer. So we don't have to consider this (or can massivly parallize it, but doing it once would be enough)....
The inner loop depends on the previous cycle(x[i-1]
).
A precomputation of z * y
changing the inner loop to x[i] = preComps[i] - z[i] * x[i - 1];
(as proposed by dvhh) might yield a benefit, when a Multiply-Accumulate instruction is used, but this is implementation-specific and I don't know the gain of this.
If there are 0
s in z[]
then there would be a possibility to parallelize. Because for z[i] = 0
-> x[i] = 0
giving you the possibility to split the inner loop at this point as
x[i+1]
will always be equal y[i+1] * z[i+1]
which is known at any moment. Giving you an entry point for another loop.