MPI vs openMP for a shared memory

Question

Lets say there is a computer with 4 CPUs each having 2 cores, so totally 8 cores. With my limited understanding I think that all processors share same memory in this case. Now, is it better to directly use openMP or to use MPI to make it general so that the code could work on both distributed and shared settings. Also, if I use MPI for a shared setting would performance decrease compared with openMP?

What is better depends on your future plans for the program. OpenMP is a lot simpler, though. — Fred Foo, Jul 04 '12 at 15:38
As phrased this question is not constructive; 'better' is far too subjective for this to get, by SO's standards, good answers. — High Performance Mark, Jul 04 '12 at 15:41

score 59 · Answer 1 · answered Jul 05 '12 at 13:01

Whether you need or want MPI or OpenMP (or both) heavily depends the type of application you are running, and whether your problem is mostly memory-bound or CPU-bound (or both). Furthermore, it depends on the type of hardware you are running on. A few examples:

Example 1

You need parallelization because you are running out of memory, e.g. you have a simulation and the problem size is so large that your data does not fit into the memory of a single node anymore. However, the operations you perform on the data are rather fast, so you do not need more computational power.

In this case you probably want to use MPI and start one MPI process on each node, thereby making maximum use of the available memory while limiting communication to the bare minimum.

Example 2

You usually have small datasets and only want to speed up your application, which is computationally heavy. Also, you do not want to spend much time thinking about parallelization, but more your algorithms in general.

In this case OpenMP is your first choice. You only need to add a few statements here and there (e.g. in front of your for loops that you want to accelerate), and if your program is not too complex, OpenMP will do the rest for you automatically.

Example 3

You want it all. You need more memory, i.e. more computing nodes, but you also want to speed up your calculations as much as possible, i.e. running on more than one core per node.

Now your hardware comes into play. From my personal experience, if you have only a few cores per node (4-8), the performance penalty created by the general overhead of using OpenMP (i.e. starting up the OpenMP threads etc.) is more than the overhead of processor-internal MPI communication (i.e. sending MPI messages between processes that actually share memory and would not need MPI to communicate).
However, if you are working on a machine with more cores per node (16+), it will become necessary to use a hybrid approach, i.e. parallelizing with MPI and OpenMP at the same time. In this case, hybrid parallelization will be necessary to make full use of your computational resources, but it is also the most difficult to code and to maintain.

Summary
If you have a problem that is small enough to be run on just one node, use OpenMP. If you know that you need more than one node (and thus definitely need MPI), but you favor code readability/effort over performance, use only MPI. If using MPI only does not give you the speedup you would like/require, you have to do it all and go hybrid.

To your second question (in case that did not become clear):
If you setup is such that you do not need MPI at all (because your will always run on only one node), use OpenMP as it will be faster. But If you know that you need MPI anyways, I would start with that and only add OpenMP later, when you know that you've exhausted all reasonable optimization options for MPI.

@Michael Schlottke: Dear Michael, could you please explain to me why the hybrid solution would be faster than the MPI-only one for use case with two or more nodes, each have 16+ CPUs? What are the drawbacks of using MPI-only in this case? Thanks a lot — neil_mccauley, Feb 11 '15 at 15:36
@neil_mccauley From personal experience (and looking at examples from other research groups), most scientific codes use a hybrid approach when trying to fully utilize many-core nodes. Especially with support for hardware threads it seems to make sense to use thread-level parallelism to a certain degree within a node (or even core). Having extreme numbers of MPI ranks increases communication, makes collective operations more costly and (arguably most importantly) increases memory consumption. Whether it makes sense in your case, can only be answered on a per-code per-machine basis. — Michael Schlottke-Lakemper, Feb 13 '15 at 08:43
@MichaelSchlottke I have a program which does many independent computational tasks. I have already implemented OpenMP loop-level parallelization within each task. However, the speedup is nowhere near the theoretical one and depends heavily on the length of the loop. Memory is not a constraint for me. In my code, communication is only needed once a task is completed, which takes few minutes to finish. Do you think that a MPI-only solution (distributing the tasks among node cores) would be much more efficient than the hybrid approach for my use case? Thanks a lot! — neil_mccauley, Feb 20 '15 at 01:24
@neil_mccauley: It depends. If your computational tasks are really independent and do not need much communication, then it seems worthy of trying MPI parallelism. If you only need communication once every couple of minutes, it should scale more or less linearly (perfectly) and you also would not have to implement that much. However, if you've already done loop-level parallelization with OpenMP, why remove it? Just check if using both can be even faster (although in your case it does not seem to be that way). — Michael Schlottke-Lakemper, Feb 20 '15 at 10:14
@MichaelSchlottke: My computational tasks are loosely coupled (it is an evolutionary algorithm). The reason I want to remove the fine grained parallelization with OpenMP is to "save" CPU cores because it does not scale well at all for my case. I rather use those cores with MPI instead. I'm also thinking about parallelize the tasks with OpenMP. Would it be better than MPI in a shared-memory environment? — neil_mccauley, Feb 20 '15 at 12:24

score 31 · Accepted Answer · edited Jul 14 '15 at 19:33

With most distributed memory platforms nowadays consisting of SMP or NUMA nodes it just makes no sense to not use OpenMP. OpenMP and MPI can perfectly work together; OpenMP feeds the cores on each node and MPI communicates between the nodes. This is called hybrid programming. It was considered exotic 10 years ago but now it is becoming mainstream in High Performance Computing.

As for the question itself, the right answer, given the information provided, has always been one and the same: IT DEPENDS.

score 4 · Answer 3 · answered Jul 04 '12 at 15:42

For use on a single shared memory machine like that, I'd recommend OpenMP. It make some aspects of the problem simpler and might be faster.

If you ever plan to move to a distributed memory machine, then use MPI. It'll save you solving the same problem twice.

The reason I say OpenMP might be faster is because a good implementation of MPI could be clever enough to spot that it's being used in a shared memory environment and optimise its behaviour accordingly.

score 3 · Answer 4 · answered Jul 04 '12 at 21:30

Just for a bigger picture, hybrid programming has become popular because OpenMP benefits from cache topology, by using the same address space. As MPI might have the same data replicated over the memory (because process can't share data) it might suffer from cache cancelation.

On the other hand, if you partition your data correctly, and each processor has a private cache, it might come to a point were your problem fit completely in cache. In this case you have super linear speedups.

By talking in cache, there are very different cache topology on recent processors, and has always: IT DEPENDS...

It's worth noting that as of MPI-3, processes can indeed share data. — Patrick Sanan, Mar 26 '18 at 06:53

MPI vs openMP for a shared memory

4 Answers4

Linked

Related