No speedup with OpenMp and MPI over MPI only

Question

I have read all the relevant questions I found, but still I could not find a solution to my issue, where I have a function, with a double for loop, that is the bottleneck of my program.

The code is designed wrt MPI:

Having a big matrix which I scatter among p processes.
Every process now has a submatrix.
Every process is calling update() in a loop.
When the loop is terminated, master process gathers results.

Now I would like to augment my MPI code with OpenMp to get faster execution, by taking advantage of the double for loop of update().

void update (int pSqrt, int id, int subN, float** gridPtr, float ** gridNextPtr)
{
        int i = 1, j = 1, end_i = subN - 1, end_j = subN - 1;
        if ( id / pSqrt == 0) {
                i = 2;
                end_i = subN - 1;
        } else if ( id / pSqrt == (pSqrt - 1) ) {
                i = 1;
                end_i = subN - 2;
        }
        #pragma omp parallel for
        for ( ; i < end_i; ++i) {
                if (id % pSqrt == 0) {
                        j = 2;
                        end_j = subN - 1; 
                } else if ((id + 1) % pSqrt == 0) {
                        j = 1;
                        end_j = subN - 2;
                }
                #pragma omp parallel for
                for ( ; j < end_j; ++j) {
                        gridNextPtr[i][j] = gridPtr[i][j]  +
                          parms.cx * (gridPtr[i+1][j] +
                          gridPtr[i-1][j] -
                          2.0 * gridPtr[i][j]) +
                          parms.cy * (gridPtr[i][j+1] +
                          gridPtr[i][j-1] -
                          2.0 * gridPtr[i][j]);
                }
        }
}

I am running this on 2 computers, where every computer has 2 CPUs. I am using 4 processes. However, I see no speedup with and without OpenMp. Any ideas please? I am compiling with -O1 optimization flag.

@HighPerformanceMark no it is not. I had it just before the 1st for loop, but after my research on related questions I decided to move it there. However, I did not see any change on time execution. — gsamaras, Jan 09 '16 at 18:34
OK then @HighPerformanceMark I may have interpreted the answers I read wrongly, so I moved the `#pragma` above the for loop. However, it still gives me the same time with and without it. — gsamaras, Jan 09 '16 at 18:47
I updated my question @HighPerformanceMark. Is it better now? — gsamaras, Jan 09 '16 at 19:15
I am not sure what you are saying @HighPerformanceMark. Discarding everything that has to do with MPI? But I want to state that the MPI+OpenMp is not faster (as I would expect) than the MPI version? — gsamaras, Jan 09 '16 at 19:24
For small number of cores it is quite normal that hybrid parallel programs are not faster than pure MPI. You should search for a difference in scaling for a large number of nodes where pure MPI already had problems. — Vladimir F Героям слава, Jan 09 '16 at 19:37
Yes, you have 4 cores and 4 processes. Why would OpenMP speed it up? Even for 2 processes each with 2 threads I wouldn't expect a large benefit. — Vladimir F Героям слава, Jan 09 '16 at 20:45
I see your point @VladimirF, but I have no experience on the topic. I can utilize 10 computers, 2 CPUs each. How would you run it? — gsamaras, Jan 09 '16 at 20:47
I would use pure MPI. Think about introducing OpenMP when you have thousands of cores or perhaps slow interconnect between computers. Read https://www.nersc.gov/users/computational-systems/retired-systems/hopper/performance-and-optimization/using-openmp-effectively-on-hopper/ and look at figure https://www.nersc.gov/assets/Documentation/Performance/_resampled/resizedimage600374-PMAMRscale.png there. For a good MPI code they have benefit from OpenMP only with more than 100 nodes (on a good supercomputer). — Vladimir F Героям слава, Jan 09 '16 at 21:11
In that case use only 2 MPI processes and 2 OpenMP threads in each process. — Vladimir F Героям слава, Jan 09 '16 at 21:13
So how many computers should I utilize? In the machines file. 2 @VladimirF? In the machines I can say how many processes every computer should create. And then I can do `mpiexec -n 4 -f machines..` for example. I am not sure how to use only 2 openMP threads. — gsamaras, Jan 09 '16 at 21:20
That differs beween MPI libraries, you must study the manual. See http://stackoverflow.com/questions/33696673/executing-hybrid-openmp-mpi-jobs-in-mpich for some example. Post a separate question about that particular problem if you have trouble. — Vladimir F Героям слава, Jan 09 '16 at 21:23
Don't use nested parallel unless you have to. Use collapse instead. — Jeff Hammond, Jan 10 '16 at 17:40
@Jeff I do not have a clue on what you are saying. Maybe post an answer with an example? In the meanwhile I am going to do a research on that. — gsamaras, Jan 10 '16 at 17:42

Gilles · Accepted Answer · 2016-01-18T13:29:16.440

What about this version (not tested)?

Please compile it and test it. If it works better, I'll explain it more. BTW, using some more aggressive compiler options might help too.

void update (int pSqrt, int id, int subN, float** gridPtr, float ** gridNextPtr)
{
    int beg_i = 1, beg_j = 1;
    int end_i = subN - 1, end_j = subN - 1;
    if ( id / pSqrt == 0 ) {
        beg_i = 2;
    } else if ( id / pSqrt == (pSqrt - 1) ) {
        end_i = subN - 2;
    }
    if (id % pSqrt == 0) {
        beg_j = 2;
    } else if ((id + 1) % pSqrt == 0) {
        end_j = subN - 2;
    }
    #pragma omp parallel for schedule(static)
    for ( int i = beg_i; i < end_i; ++i ) {
        #pragma omp simd
        for ( int j = beg_j; j < end_j; ++j ) {
            gridNextPtr[i][j] = gridPtr[i][j] +
                parms.cx * (gridPtr[i+1][j] +
                        gridPtr[i-1][j] -
                        2.0 * gridPtr[i][j]) +
                parms.cy * (gridPtr[i][j+1] +
                        gridPtr[i][j-1] -
                        2.0 * gridPtr[i][j]);
        }
    }
}

EDIT: Some explanations on what I did to the code...

The initial version was using nested parallelism for no reason at all (a parrallel region nested within another parallel region). This was likely to be very counter effective and I simply removed it.
The loop indexes i and j were declared and initialised outside of the for loop statements. This is error-prone at two level: 1/ it might forces to declare their parallel scope (private) whereas having them within the for statement automatically give them the right one; and 2/ you can have mix-up by erroneously reusing the indexes outside of the loops. Moving them into the for statement was easy.
You were changing the boundaries of the j loops inside the parallel region for no good reason. You would have had to declare end_j private. Moreover, it was a potential limitation for further developments (like potential use of the collapse(2) directive), as it was breaking the rules for a canonical loop form as defined in the OpenMP standard. So defining some beg_i and beg_j outside of the parallel region was making sense, sparing computations and simplifying the form of the loops, keeping them canonical.

Now from there, the code was suitable for vectorisation, and the addition of a simple simd directive on the j loop would enforce it, should the compiler fail to see the possible vectorisation by itself.

Thanks for the answer. I compile like: `mpicc -O1 -o myheat myheat.c -lm` — gsamaras, Jan 09 '16 at 19:19
I used `-std=c99` flag and it's now OK. I see no speedup again! — gsamaras, Jan 09 '16 at 19:25
did you try with the (erroneous) `schedule(static,1)`? If so, please try without. And try `-O3 -mtune=native -march=native` (assuming you used GCC) — Gilles, Jan 09 '16 at 19:26
@Gilles (1) reduces the available parallelism. This is bad if you have more threads than the range of `i`, which I find happens routinely on Intel Xeon Phi processors, which support 200+ threads. (2) is superfluous because `The loop iteration variable(s) in the associated for-loop(s) of a for or parallel for construct is (are) private.` This text is taken verbatim from the OpenMP standard. — Jeff Hammond, Jan 18 '16 at 21:51

Jeff Hammond · Answer 2 · 2016-01-22T05:03:13.217

4

High-level analysis

It is a common fallacy that hybrid programming (e.g. MPI+OpenMP) is a good idea. This fallacy is widely espoused by so-called HPC experts, many of whom are paper pushers at supercomputing centers and do not write much code. An expert take-down of the MPI+Threads fallacy is Exascale Computing without Threads.

This is not to say that flat MPI is the best model. For example, MPI experts espouse a two-level MPI-only approach in Bi-modal MPI and MPI+threads Computing on Scalable Multicore Systems and MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory (free version). In the so-called MPI+MPI model, the programmer exploits shared-memory coherence domains using MPI shared-memory instead of OpenMP but with a private by default data model, which reduces the incidence of race conditions. Furthermore, MPI+MPI uses only one runtime system, which makes resource management, and process topology/affinity easier. In contrast, MPI+OpenMP requires one to use either the fundamentally a non-scalable fork-join execution model with threads (i.e. make MPI calls between OpenMP parallel regions) or enable MPI_THREAD_MULTIPLE in order to make MPI calls within threaded regions - MPI_THREAD_MULTIPLE entails noticeable overhead on today's platforms.

This topic could cover many pages, which I do not have time to write at the moment, so please see the cited links.

Reducing OpenMP runtime overheads

One reason that MPI+OpenMP does not perform as well as pure MPI is that OpenMP runtime overheads have a tendency to appear in too many places. One type of unnecessary runtime overhead is from nested parallelism. Nested parallelism occurs when one nests on omp parallel construct inside another. Most programmers do not know that a parallel region is a relatively expensive construct and one should try to minimize them. Furthermore, omp parallel for is the fusion of two constructions - parallel and for - and one should really try to think of these independently. Ideally, you create one parallel region that contains many worksharing constructs, such as for, sections, etc.

Below is your code modified to use only one parallel region and parallelism across both for loops. Because collapse requires perfect nesting (nothing between the two for loops), I had to move some code inside. However, nothing stops the compiler from hoisting this loop invariant back out after the OpenMP lowering (this is a compiler concept, which you can ignore), so that code might still only execute end_i times, rather than end_i*end_j times.

Update: I've adapted the code from the other answer to demonstrate collapse instead.

There are a variety of ways to parallelize these two loops with OpenMP. Below you can see four versions, of all of which are compliant with OpenMP 4. Version 1 is likely to be the best, at least in current compilers. Version 2 uses collapse but not simd (it is compliant with Open 3). Version 3 could be the best, but is harder to implement ideally and does not lead to SIMD code generation with some compilers. Version 4 parallelizes only the outer loop.

You should experiment to see which one of these options is the fastest for your application.

#if VERSION==1
#define OUTER _Pragma("omp parallel for")
#define INNER _Pragma("omp simd")
#elif VERSION==2
#define OUTER _Pragma("omp parallel for collapse(2)")
#define INNER
#elif VERSION==3
#define OUTER _Pragma("omp parallel for simd collapse(2)")
#define INNER
#elif VERSION==4
#define OUTER _Pragma("omp parallel for simd")
#define INNER
#else
#error Define VERSION
#define OUTER
#define INNER
#endif


struct {
    float cx;
    float cy;
} parms;

void update (int pSqrt, int id, int subN, const float * restrict gridPtr[restrict], float * restrict gridNextPtr[restrict])
{
    int beg_i = 1, beg_j = 1;
    int end_i = subN - 1, end_j = subN - 1;
    if ( id / pSqrt == 0 ) {
        beg_i = 2;
    } else if ( id / pSqrt == (pSqrt - 1) ) {
        end_i = subN - 2;
    }
    if (id % pSqrt == 0) {
        beg_j = 2;
    } else if ((id + 1) % pSqrt == 0) {
        end_j = subN - 2;
    }
    OUTER
    for ( int i = beg_i; i < end_i; ++i ) {
        INNER
        for ( int j = beg_j; j < end_j; ++j ) {
            gridNextPtr[i][j] = gridPtr[i][j] + parms.cx * (gridPtr[i+1][j] + gridPtr[i-1][j] - 2 * gridPtr[i][j])
                                              + parms.cy * (gridPtr[i][j+1] + gridPtr[i][j-1] - 2 * gridPtr[i][j]);
        }
    }
}

The example code above is correct with the following compilers:

GCC 5.3.0
Clang-OpenMP 3.5.0
Cray C 8.4.2
Intel 16.0.1.

It will not compile with PGI 11.7, both because of [restrict] (replacing it with [] is sufficient) and the OpenMP simd clause. This compiler lacks full support for C99 contrary to this presentation. It's not too surprising that it is not compliant with OpenMP, given it was released in 2011. Unfortunately, I do not have access to a newer version.

edited Jan 22 '16 at 05:03

answered Jan 10 '16 at 23:49

Jeff Hammond

5,374
3
28
45

Thanks for the great answer Jeff. However I am executing this code with 2 processes and it seems to not be finishing. Any idea? – gsamaras Jan 11 '16 at 23:31
It's possible that the code needs `private(i,j,end_i,end_j)` or something like that. One of the pitfalls of not providing a [MCVE](http://stackoverflow.com/help/mcve) is that third-parties cannot V(erify) it or any modifications thereto. – Jeff Hammond Jan 11 '16 at 23:34
I know Jeff, but the code is big. I placed a `printf()` right next the second for and I got `i = 1, j = 3, end_i = 2049, end_j = 2049 i = 1, j = 3, end_i = 2049, end_j = 2049` and i seems to get values in [1, 3] and looping over. Where should I place `private()`? Just next to `collapse(2)`? – gsamaras Jan 11 '16 at 23:41
1

Yeah. For what it's worth the [LLNL tutorial](https://computing.llnl.gov/tutorials/openMP/#DO) is excellent. That is my default reference for OpenMP syntax. – Jeff Hammond Jan 11 '16 at 23:42
Feel free to put the code on Github rather than trying to inline here. – Jeff Hammond Jan 11 '16 at 23:46
I tried with `#pragma omp parallel for collapse(2) private(i,j,end_i,end_j)`, but the pattern is repeated. I will provide a MCVE now in my question. I will think about git in future questions. – gsamaras Jan 11 '16 at 23:49
@Jeff, could you help explain why this model is not salable in `MPI+OpenMP requires one to use either the fundamentally a non-scalable fork-join execution `? As my understand, we also can fork lots of threads/processes in local machine based on # of cores. BTW, yes, MPI + mutlithreads may have potential problems in some cases, but it is still very useful and widely used in industry and research software. Great answer +1. – Patric Jan 12 '16 at 06:13
Is it possible that the problem is that I use `MPI_INIT()` instead of `Mpi_init_thread()`? – gsamaras Jan 12 '16 at 12:25
@gsamaras While you are required to use `MPI_Init_thread` with the appropriate argument any time you use threads with MPI, the only case where I know that it breaks if you don't is `MPI_THREAD_MULTIPLE`. The common implementations do not change when you use the other thread modes, because they do not require mutual exclusion internally. – Jeff Hammond Jan 12 '16 at 13:46
@Patric It is unscalable because of Amdahl's Law. If you fork-join for compute, and do MPI from the master thread only, your compute time may decrease as the number of threads increases, while the compute will not. In the limit of many threads, you will spend all your time in MPI. And this assumes you have perfectly scaling compute. – Jeff Hammond Jan 12 '16 at 13:48
I tried with `MPI_Init_thread(&argc,&argv, MPI_THREAD_SINGLE, &provided );` but I can see no speedup (with the simple OpenMP code in my question). – gsamaras Jan 12 '16 at 23:08
@Jeff, thanks for your answers. I think it's still dependent on the case such as the ratio of computation/communication/management costs, and what kind of (weak or strong) scalability we are care about. For AL, just prefect parallel program can be scalable; any sequential parts will limit the speedup suppose we scale out a lot. – Patric Jan 13 '16 at 03:12
@gsamaras `MPI_Init_thread` should not make your code faster. It is necessary to make your code correct if you use MPI and threads in the same program. – Jeff Hammond Jan 18 '16 at 21:46
@Jeff I could not make it finish (your code), so please check the edit history of my question, update your answer with working code and then I will check. – gsamaras Jan 19 '16 at 11:47
The other answer has better code. Is it working? Use that instead. – Jeff Hammond Jan 19 '16 at 12:17
That's what I am doing @Jeff, but it would *really* nice to see how your answer actually works, but I will keep my upvote, since **in theory**, it's nice. Thanks. – gsamaras Jan 21 '16 at 23:22
@gsamaras Okay, I took the code from the other answer and show how it works with `collapse(2)`. The Intel compiler does not SIMD-ize the code this way though, for reasons I don't understand. Perhaps it is just a limitation in the current implementation, wherein the `collapse` and `simd` clauses don't play nice together. – Jeff Hammond Jan 21 '16 at 23:38
C99 restrict keyword is well documented online already. Did you try to look it up? – Jeff Hammond Jan 22 '16 at 01:24
I did try to compile, but my compiler, despite having the c99 flag enabled, does not seem happy about it. I replaced it with `float**`. From some experiments it seems that it can not beat the other answer, but it's much slower either. However, I am worried for the validity of this approach, thus I am going to stop here. Thanks again for the help. – gsamaras Jan 22 '16 at 01:40
Sure, you can change the array declaration. I was just trying to max out compiler auto vectorization. But whatever compiler you are using is not supporting C99 and you should ask for a refund. – Jeff Hammond Jan 22 '16 at 02:04

No speedup with OpenMp and MPI over MPI only

2 Answers2

High-level analysis

Reducing OpenMP runtime overheads