----- example code -----------
for (body1 = 0; body1 < NBODIES; body1 ++) {
for (body2=0; body2 < NBODIES; body2++) {
OUT[body1] += compute(body1, body2);
}
}
----- blocking code------
for (body2 = 0; body2 < NBODIES; body2 += BLOCK) {
for (body1=0; body1 < NBODIES; body1 ++) {
for (body22=0; body22 < BLOCK; body22 ++) {
OUT[body1] += compute(body1, body2 + body22);
}
}
}
I insert OpenACC directives to offload code to GPU. But the performance was decreasing. I search some paper,and they conclude the reason is that OpenACC can not take advantage of shared memory in GPU. But I think the main reason is that the tilling/blocking prevent the parallell. Because the tilling bring data dependence. If the OpenACC do not supply or not encourage code tilling? If there is a solusion or example that tilling tech improve the OpenACC code.