In the code below I have parallelised using OpenMP's standard parallel for
clause.
#pragma omp parallel for private(i, j, k, d_equ) shared(cells, tmp_cells, params)
for(i=0; i<some_large_value; i++)
{
for(j=0; j<some_large_value; j++)
{
....
// Some operations performed over here which are using private variables
....
// Accessing a shared array is causing False Sharing
for(k=0; k<10; k++)
{
cells[i * width + j].speeds[k] = some_calculation(i, j, k, cells);
}
}
}
This has given me a significant improvement to runtime (~140s to ~40s) but there is still one area I have noticed really lags behind - the innermost loop I marked above.
I know for certain the array above is causing False Sharing because if I make the change below, I see another huge leap in performance (~40s to ~13s).
for(k=0; k<10; k++)
{
double somevalue = some_calculation(i, j);
}
In other words, as soon as I changed the memory location to write to a private variable, there was a huge speed up improvement.
Is there any way I can improve my runtime by avoiding False Sharing in the scenario I have just explained? I cannot seem to find many resources online that seem to help with this problem even though the problem itself is mentioned a lot.
I had an idea to create an overly large array (10x what is needed) so that enough margin space is kept between each element to make sure when it enters the cache line, no other thread will pick it up. However this failed to create the desired effect.
Is there any easy (or even hard if needs be) way of reducing or removing the False Sharing found in that loop?
Any form of insight or help will be greatly appreciated!
EDIT: Assume some_calculation() does the following:
(tmp_cells[ii*params.nx + jj].speeds[kk] + params.omega * (d_equ[kk] - tmp_cells[ii*params.nx + jj].speeds[kk]));
I cannot move this calculation out of my for loop because I rely on d_equ which is calculated for each iteration.