In my code, I have the following section (simplified)
#pragma omp parallel for
for(i = 0; i < N; i++) {
int x = struct_arr[i].x;
double y = struct_arr[i].y;
double z = struct_arr[i].z;
double w = struct_arr[i].w;
out[i].x = get_new_x(x,y,z,w);
}
which, when parallelized, suffers drastic slowdowns. I suspected that there was an issue with false sharing, and using valgrind I found that there were a lot of cache misses in a given execution.
I have not provided details on what goes on in get_new_x, since I want to focus on one thing at a time; is it reasonable to guess that there is some false sharing going on in the part running up to the function call? Each thread would have their own local variables for x,y,z,w but they would all be reading from the same array. Could this be enough to cause cache misses? Similarily, I suspect that there might be cache conflict issue when writing from get_new_x to out[].
I guess all of these are possible causes of false sharing, but what are some ways of fixing it? Is any operation (reading vs writing) more or less likely to cause false sharing issues?