Efficiency of OpenMP vs optimisation levels

Question

I am new to openmp, but I have been puzzled by this for a few days and couldn't find any answer online. Hopefully someone here can explain this strange phenomenon to me.

I wanted to compare runtimes between a sequential and a parallel version of the same program. The parallel version runs much faster than the sequential one (~5x) when I compile them (on gcc-10) with -O or above (but the differences between the different levels are quite small).

However, this is not the case when I compile both programs using -O0. In fact, when computing both versions with -O0, the sequential version is even slightly faster. I tried to understand if some of the optimisations enabled only in O1 and above were having a substantial effect, but without luck.

For the record, compiling with -Os is better than -O0 but far less efficient than -O1 and above.

Did anyone notice something similar? Is there an explanation for this?

Thanks!

====

Here are links to the c files: sequential code, parallel code

Without any specific code samples or an [MRE](https://stackoverflow.com/help/minimal-reproducible-example), one can only speculate. — Hristo Iliev, Nov 06 '20 at 22:24
Right, sorry. The code is quite uninteresting. Basically the sequential version is a loop with several chunks of code computing algebraic expressions and storing them in a table. The parallel version consists of eight loops ran on eight different threads performing the same computations. — seiller.t, Nov 06 '20 at 22:26
@Hristolliev I now added links to the files so you can get the code. You will see that computations are meaningless but I wanted to have substantial runtime. The parallel code is written for computing on eight threads, and the example is built exactly for that. — seiller.t, Nov 06 '20 at 22:38
You should include code fragments in your question and not link to other sites as those have the tendency to disappear and/or move. In any case, the runtime of your parallel code is that of the slowest thread, i.e., the one doing the most complex calculation. Since `xa`, `xb`, etc. are all shared, with `-O0` and OpenMP enabled, GCC uses double pointer indirection when reading and storing their values. This is not the case without OpenMP. Examine the assembly code for further clues. — Hristo Iliev, Nov 06 '20 at 23:04
@HristoIliev I understand the runtime is that of the most complex computation. However, I would not expect it to take more time than the sequential program that performs all of the 8 computations. Moreover, the parallel version with O1-3 is 5x faster than the sequential one (I.e. the sequential code runs twice as fast than with O0, the parallel one ten times faster). Can this huge difference in efficiency be explained by some specific optimisation? — seiller.t, Nov 06 '20 at 23:14
See lines 919-924 [here](https://godbolt.org/z/8Tvzsn) and compare with line 306 [here](https://godbolt.org/z/8799x1). And yes, the huge difference is explained with register optimisation and the OpenMP memory model which allows threads to have temporarily divergent views on shared variables. — Hristo Iliev, Nov 06 '20 at 23:17

score 3 · Accepted Answer · answered Nov 07 '20 at 00:01

The core of all your loops is something like:

var += something;

In the sequential code, each var is a local stack variable and with -O0 the line compiles to:

; Compute something and place it in RAX
ADD QWORD PTR [RBP-vvv], RAX

Here vvv is the offset of var in the stack frame rooted at the address stored in RBP.

With OpenMP, certain transformations of the source code take place and the same expression becomes:

*(omp_data->var) = *(omp_data->var) + something;

where omp_data is a pointer to a structure holding pointers to the shared variables used in the parallel region. This compiles to:

; Compute something and store it in RAX
MOV RDX, QWORD PTR [RBP-ooo]  ; Fetch omp_data pointer
MOV RDX, QWORD PTR [RDX]      ; Fetch *(omp_data->var)
ADD RDX, RAX
MOV RAX, QWORD PTR [RBP-ooo]  ; Fetch omp_data pointer
MOV QWORD PTR [RAX], RDX      ; Assign to *(omp_data->var)

This is the first reason the parallel code is slower - the simple action of incrementing var involves more memory accesses.

The second, and actually stronger reason is the false sharing. You have 8 shared accumulators: xa, xb, etc. Each is 8 bytes long and aligned in memory for a total of 64 bytes. Given how most compilers place such variables in memory, they most likely end up next to each other in the same cache line or in two cache lines (a cache line on x86-64 is 64 bytes long and is read and written as a single unit). When one thread writes to its accumulator, e.g., thread 0 updates xa, this invalidates the cache of all other threads whose accumulators happen to be in the same cache line and they need to re-read the value from an upper level cache or even the main memory. This is bad. This is so bad, that the slowdown it causes is way worse than having to access the accumulators through double pointer indirection.

What does -O1 change? It introduces register optimisation:

register r = *(omp_data->var);
for (a = ...) {
   r += something;
}
*(omp_data->var) = r;

Despite var being a shared variable, OpenMP allows for temporarily divergent memory views in each thread. This allows the compiler to perform register optimisation, in which the value of var does not change for the duration of the loop.

The solution is to simply make all xa, xb, etc. private.

Efficiency of OpenMP vs optimisation levels

1 Answers1

Linked