2

Would false sharing happen in the following program?

Memory

  • 1 array divided into 4 equal regions: [A1, A2, B1, B2]
  • The whole array can fit into L1 cache in the actual program.
  • Each region is padded to be a multiple of 64 bytes.

Steps

1. thread 1 write to region A1 and A2 while thread 2 write to region B1 and B2.
2. barrier
3. thread 1 read B1 and write to A1 while thread 2 read B2 and write to A2.
4. barrier
5. Go to step 1.

Test

#include <vector>
#include <iostream>
#include <stdint.h>
int main() {
    int N = 64;
    std::vector<std::int32_t> x(N, 0);
    #pragma omp parallel
    {
        for (int i = 0; i < 1000; ++i) {
            #pragma omp for
            for (int j = 0; j < 2; ++j) {
                for (int k = 0; k < (N / 2); ++k) {
                    x[j*N/2 + k] += 1;
                }
            }
            #pragma omp for
            for (int j = 0; j < 2; ++j) {
                for (int k = 0; k < (N/4); ++k) {
                    x[j*N/4 + k] += x[N/2 + j*N/4 + k] - 1;
                }
            }
        }
    }
    for (auto i : x ) std::cout << i << " ";
    std::cout << "\n";
}

Result

32 elements of 500500 (1000 * 1001 / 2)
32 elements of 1000
R zu
  • 2,034
  • 12
  • 30
  • How do you stop your threads being re-scheduled on different cores? – Richard Critten Oct 02 '18 at 16:12
  • @RichardCritten I don't know. I hope: 1. the `#pragma omp parallel` line creates 2 threads that each stick with one core. 2. the remaining lines just use the two existing omp threads. Can open mp do that? I am just thinking about the algorithm now. update: Probably open mp can if I set environmental variables: http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-affinity.html If it can't, then I will try to learn pthreads. – R zu Oct 02 '18 at 16:13
  • If the entire array can be loaded into a single cache line then you will have issues. To avoid false sharing you have to have each segment you are working on in a different cache line. – NathanOliver Oct 02 '18 at 16:22
  • @NathanOliver: If the entire array barely fits into `n_cores x L1 cache size`, it will be fine? If the cpu cores never write to the same cache line at the same time, would it be fine? – R zu Oct 02 '18 at 16:25
  • 1
    It's cache lines you have to worry about. That is typically 64 bytes. You can have different cores working on different cache lines at the same time. – NathanOliver Oct 02 '18 at 16:27
  • But ... Suppose core 1's cache stores cache line A and B. Now, core 1 write to cache line B and core 2 writes to cache line B. While both cores write to different cache lines at the same time, would the existing content of the cache matter? – R zu Oct 02 '18 at 16:30

1 Answers1

4

There is some false sharing in your code since x is not guaranteed to be aligned to a cache-line. Padding is not necessarily enough. In your example N is really small which may be a problem. Note at your example N, the biggest overhead would probably be worksharing and thread management. If N is sufficiently large, i.e. array-size / number-of-threads >> cache-line-size, false sharing is not a relevant problem.

Alternating writes to A2 from different threads in your code is also not optimal in terms of cache usage, but that is not a false sharing issue.

Note, you do not need to split the loops. If you access index into memory contiguously in a loop, one loop is just fine, e.g.

#pragma omp for
for (int j = 0; j < N; ++j)
    x[j] += 1;

If you are really careful you may add schedule(static), then you have a guarantee of an even contiguous word distribution.

Remember that false sharing is a performance issue, not a correctness problem, and only relevant if it occurs frequently. Typical bad patterns are writes to vector[my_thread_index].

Zulan
  • 21,896
  • 6
  • 49
  • 109
  • Thanks a lot for the answer. I am using Eigen library (simd linear algebra library). so the memory alignment should be fine. Thread management means the barrier in this case? – R zu Oct 02 '18 at 16:35
  • Thread management would be the barrier and anything that's necessary to split work (i.e. loop iterations) among participating threads. – Zulan Oct 03 '18 at 08:02
  • You're also likely to have performance issues due to aggressive HW prefetching. I'd consider this as false-sharing as well (or some extension of the same concept), although it's not exactly the textbook definition. – Leeor Oct 17 '18 at 13:26