0

In the code below I have parallelised using OpenMP's standard parallel for clause.

#pragma omp parallel for private(i, j, k, d_equ) shared(cells, tmp_cells, params)
for(i=0; i<some_large_value; i++)
{
   for(j=0; j<some_large_value; j++)
   {
       ....
       // Some operations performed over here which are using private variables
       ....

       // Accessing a shared array is causing False Sharing
       for(k=0; k<10; k++)
       {
          cells[i * width + j].speeds[k] = some_calculation(i, j, k, cells);
       }
   }
}

This has given me a significant improvement to runtime (~140s to ~40s) but there is still one area I have noticed really lags behind - the innermost loop I marked above.

I know for certain the array above is causing False Sharing because if I make the change below, I see another huge leap in performance (~40s to ~13s).

 for(k=0; k<10; k++)
 {
     double somevalue = some_calculation(i, j);
 }

In other words, as soon as I changed the memory location to write to a private variable, there was a huge speed up improvement.

Is there any way I can improve my runtime by avoiding False Sharing in the scenario I have just explained? I cannot seem to find many resources online that seem to help with this problem even though the problem itself is mentioned a lot.

I had an idea to create an overly large array (10x what is needed) so that enough margin space is kept between each element to make sure when it enters the cache line, no other thread will pick it up. However this failed to create the desired effect.

Is there any easy (or even hard if needs be) way of reducing or removing the False Sharing found in that loop?

Any form of insight or help will be greatly appreciated!

EDIT: Assume some_calculation() does the following:

 (tmp_cells[ii*params.nx + jj].speeds[kk] + params.omega * (d_equ[kk] - tmp_cells[ii*params.nx + jj].speeds[kk]));

I cannot move this calculation out of my for loop because I rely on d_equ which is calculated for each iteration.

Michael Aquilina
  • 5,352
  • 4
  • 33
  • 38

1 Answers1

1

Before anwsering your question, I have to ask is it really a false sharing situation when you use the whole cells as the input of the function some_calcutation()? It seems you are sharing the whole array actrually. You may want to provide more info about this function.

If yes, go on with the following.

You've already show that private variable double somevaluewill improve the performance. Why not just use this approach?

Instead of using a single double variable, you could define a private array private_speed[10] just before the for k loop, calculate them in the loop, and copy it back to cells after the loop with Something like

 memcpy(cells[i*width+j].speed, private_speed, sizeof(...));
kangshiyin
  • 9,681
  • 1
  • 17
  • 29
  • How exactly would I copy a private variable after the loop? I will update my code to show what the exact calculation was. I was using the `some_calculation` function as a placeholder to hide un-needed implementation details. – Michael Aquilina Oct 13 '13 at 19:01
  • 1
    Something like `memcpy(cells[i*width+j].speed, private_speed, sizeof(...))` should be fine. – kangshiyin Oct 13 '13 at 19:06
  • I just tried your memcpy idea, but it hasn't changed the speed at all. While in theory it should be the case that less False Sharing should occur, the cell array itself is still updated which causes other cores to invalidate any cached copies they have. – Michael Aquilina Oct 13 '13 at 19:13
  • Then you could use larger private variable defined outside the loop for j , containing one row of the shared array cells, to get fewer false sharing. – kangshiyin Oct 13 '13 at 19:20
  • I can't understand how this can be a bug? The problem with assigning a private array is that there is no way of telling which portion of the target array you should copy to. – Michael Aquilina Oct 13 '13 at 20:44
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/39154/discussion-between-michael-aquilina-and-eric) – Michael Aquilina Oct 13 '13 at 21:21
  • Why not? The portion is identified by other private variables i and j – kangshiyin Oct 14 '13 at 02:31