Compiler optimization eliminates effects of false sharing. How?

Question

I'm trying to replicate the effects of false sharing using OpenMP as explained in the OpenMP introduction by Tim Mattson.

My program performs a straightforward numerical integration (see the link for the mathematical details) and I've implemented two versions, the first of which is supposed to be cache-friendly having each thread keep a local variable to accumulate its portion of the index space,

const auto num_slices = 100000000; 
const auto num_threads = 4;  // Swept from 1 to 9 threads
const auto slice_thickness = 1.0 / num_slices;
const auto slices_per_thread = num_slices / num_threads;

std::vector<double> partial_sums(num_threads);

#pragma omp parallel num_threads(num_threads)
{
  double local_buffer = 0;
  const auto thread_num = omp_get_thread_num();
  for(auto slice = slices_per_thread * thread_num; slice < slices_per_thread * (thread_num + 1); ++slice)
    local_buffer += func(slice * slice_thickness); // <-- Updates thread-exclusive buffer
  partial_sums[thread_num] = local_buffer; 
}
// Sum up partial_sums to receive final result
// ...

while the second version has each thread update an element in a shared std::vector<double>, causing each write to invalidate the cache lines on all other threads

// ... as above
#pragma omp parallel num_threads(num_threads)
{
  const auto thread_num = omp_get_thread_num();
  for(auto slice = slices_per_thread * thread_num; slice < slices_per_thread * (thread_num + 1); ++slice)
    partial_sums[thread_num] += func(slice * slice_thickness); // <-- Invalidates caches
}
// Sum up partial_sums to receive final result
// ...

The problem is that I am unable to see any effects of false sharing whatsoever unless I turn off optimization.

Compiling my code (which has to account for a few more details than the snippets above) using GCC 8.1 without optimization (-O0) yields the results I naively expected while using full optimization (-O3) eliminates any difference in terms of performance between the two versions, as shown in the plot.

What's the explanation for this? Does the compiler actually eliminate false sharing? If not, how come the effect is so small when running the optimized code?

I'm on a Core-i7 machine using Fedora. The plot displays mean values whose sample standard deviations don't add any information to this question.

What are the values for `nslice`and `nthread_obt_local`? Can't compile otherwise. — Matthieu Brucher, Oct 13 '18 at 10:36
You can check https://godbolt.org what the generated code is, seems like gcc optimizes a lot of the small fragment. — Matthieu Brucher, Oct 13 '18 at 10:45
@MatthieuBrucher My bad - names corrected. `num_slices` is fixed while `num_threads` is swept, as shown in the plot. — Slavistan Slavistan, Oct 13 '18 at 10:46

einpoklum · Accepted Answer · 2018-10-15T13:53:22.697

11

tl;dr: The compiler optimizes your second version into the first.

Consider the code within the loop of your second implementation - ignoring the OMP/multithreaded aspect of it for a moment.

You have increments of a value within an std::vector - which is necessarily located on the heap (well, up until and including in C++17 anyway). The compiler sees you're adding to a value on the heap within a loop; this is a typical candidate for optimization: It takes the heap access out of the loop, and uses a register as a buffer. It doesn't even need to read from the heap, since they're just additions - so it essentially arrives at your first solution.

See this happening on GodBolt (with a simplified example) - notice how the code for bar1() and bar2() is almost the same, with accumulation happening in registers.

Now, the fact that there's multi-threading and OMP involved doesn't change the above. If you were to use, say, std::atomic<double> instead of double, then it might have changed (and maybe not even then, if the compiler is smart enough).

Notes:

Thanks goes to @Evg for noticing a glaring mistake in the code of a previous version of this answer.
The compiler must be able to know that func() won't also change the value of your vector - or to decide that, for the purposes of addition, it shouldn't really matter.
This optimization could be seen as a Strength Reduction - from an operation on the heap to that on a register - but I'm not sure that term is in use for this case.

edited Oct 15 '18 at 13:53

answered Oct 13 '18 at 11:15

einpoklum

118,144
57
340
684

1

Note that in your example at GodBolt you're modifying the pointer `result` itself, not the value `*result` it points to, and the value returned by `foo` (in `eax`) is not even used! If you correct the code, you'll see that the generated assemblies are not identical at all, the compiler does not perform the optimization you describe. – Evg Oct 14 '18 at 09:41
Great answer. I saw this in an answer that came to the same conclusion a few months ago (heap put in register) but I'm still a bit surprised by this. I recently gave an answer which explicitly avoided false sharing without using `threadprivate` but maybe that was not necessary https://stackoverflow.com/a/52754723/2542702. I wonder how reliable this optimization is? Premature optimization is the root of all evil but. I would have that this was a necessary explicit optimization but maybe it's now premature (and maybe has been for a long time). – Z boson Oct 15 '18 at 07:21
@Evg: You're right; but correcting the code also involved making sure the compiler can realize `int foo()` can't impact `*result`. See new link – einpoklum Oct 15 '18 at 10:00
1

A couple of observations: 1) `__restrict__` seems to have no influence on the code, 2) if you add `__attribute__((noinline))` to `foo` to prevent inlining, the assembly code will be even simpler to grasp. – Evg Oct 15 '18 at 10:35

Compiler optimization eliminates effects of false sharing. How?

1 Answers1

tl;dr: The compiler optimizes your second version into the first.