Parallel Demonstration Program

Question

An assignment that I've just now completed requires me to create a set of scripts that can configure random Ubuntu machines as nodes in an MPI computing cluster. This has all been done and the nodes can communicate with one another properly. However, I would now like to demonstrate the efficiency of said MPI cluster by throwing a parallel program at it. I'm just looking for a straight brute force calculation that can divide up work among the number of processes (=nodes) available: if one node takes 10 seconds to run the program, 4 nodes should only take about 2.5.

With that in mind I looked for a prime calculation programs written in C. For any purists, the program is not actually part of my assignment as the course I'm taking is purely systems management. I just need anything that will show that my cluster is working. I have some programming experience but little in C and none with MPI. I've found quite a few sample programs but none of those seem to actually run in parallel. They do distribute all the steps among my nodes so if one node has a faster processor the overall time will go down, but adding additional nodes does nothing to speed up the calculation.

Am I doing something wrong? Are the programs that I've found simply not parallel? Do I need to learn C programming for MPI to write my own program? Are there any other parallel MPI programs that I can use to demonstrate my cluster at work?

EDIT

Thanks to the answers below I've managed to get several MPI scripts working, among which the sum of the first N natural numbers (which isn't very useful as it runs into data type limits), the counting and generating of prime numbers and the Monte Carlo calculation of Pi. Interestingly only the prime number programs realise a (sometimes dramatic) performance gain with multiple nodes/processes.

The issue that caused most of my initial problems with getting scripts working was rather obscure and apparently due to issues with hosts files on the nodes. Running mpiexec with the -disable-hostname-propagation parameter solved this problem, which may manifest itself in a variety of ways: MPI(R) barrier errors, TCP connect errors and other generic connection failures. I believe it may be necessary for all nodes in the cluster to know one another by hostname, which is not really an issue in classic Beowulf clusters that have DHCP/DNS running on the server node.

Andreas Grapentin · Accepted Answer · 2013-03-08T20:01:53.957

4

The usual proof of concept application in parallel programming is simple raytracing.

That being said, I don't think that raytracing is a good example to show off the power of OpenMPI. I'd put the emphasis on scatter/gather or even better scatter/reduce, because that's where MPI gets the true power :)

the most basic example for that would be calculating the sum over the first N integers. You'll need to have a master thread, that fits value ranges to sum over into an array, and scatter these ranges over the number of workers.

Then you'll need to do a reduction and check your result against the explicit formula, to get a free validation test.

If you're looking for a weaker spot of MPI, a parallel grep might work, where IO is the bottleneck.

EDIT

You'll have to keep in mind that MPI is based on a shared nothing architecture where the nodes communicate using messages, and that the number of nodes is fixed. these two factors set a very tight frame for the programs that run on it. To make a long story short, this kind of parallelism is great for data-parallel applications, but sucks for task-parallel applications, because you can usually distribute data better than tasks if the number of nodes changes.

Also, MPI has no concept of implicit work-stealing. if a node is finished working, it just sits around waiting for the other nodes to finish. that means, you'll have to figure out weakest-link handling yourself.

MPI is very customizable when it comes to performance details, there are numerous different variants of MPI_SEND, for example. That leaves much room for performance tweaking, which is important for high performance computing, for which MPI was designed, but is mostly confusing "ordinary" programmers, leading to programs that actually get slower when run parallel. maybe your examples just suck :)

And on the scaleup / speedup problem, well...

I suggest that you read into Amdahl's Law, and you'll see that it's impossible to get linear speedup by just adding more nodes :)

I hope that helped. If you still have questions, feel free to drop a comment :)

EDIT2

maybe the best scaling problem that integrates perfectly with MPI is the empiric estimation of Pi.

Imaging a quarter circle with the radius 1, inside a square with sides of length 1, then you can estimate Pi by firing random points into the square and calculate if they're inside of the quarter circle.

note: this is equal to generating tuples (x,y) with x,y in [0, 1] and measuring how many of these have x² + y² <= 1.

Pi is then roughly equal to

4 * Points in Circle / total Points

In MPI you'd just have to gather the ratios generated from all threads, which is very little overhead and thus gives a perfect proof of concept problem for your cluster.

edited Mar 08 '13 at 20:01

answered Mar 08 '13 at 18:55

Andreas Grapentin

5,499
4
39
57

A very interesting reply. When you refer to the number of nodes as fixed, do you mean that it has to be known and remain stable while an operation is running? Edit (posted by mistake on tabchange): I'm also aware of the diminishing returns in real-world scenarios, but I thought that for simple arithmetic calculations, such as the sum operation you suggest, where work can be fully divided and can run fully in parallel, capacity to performance would rise in nearly equal measure. – Lilienthal Mar 08 '13 at 19:13
@Lilienthal you're right, the word "fixed" is a little ambiguous there. what I wanted to say was that the MPI work distributer spawns the same program on every node, and that no new additional copies can join the party while it's running. – Andreas Grapentin Mar 08 '13 at 19:20
@Lilienthal you've always got the overhead of distributing the work to the nodes which sometimes cancels the performance gain for few nodes. and for many nodes, the diminishing returns grow exponentially. Also, there's the problem of measuring sequential performance :) but yeah.. simple calculations that take a measurable amount of time should be sped up nearly linearly by the parallelization, as long as the number of nodes is reasonably small. – Andreas Grapentin Mar 08 '13 at 19:26
I see. As for the sum algorithm, would this be about right for p=4, p being the number of processes? `sum(1..n) = sum(1..n/p) + sum((n/p+1)..2n/p) + sum((2n/p+1)..3n/p) + sum((3n/p+1)..n)`. Obviously it needs some work but I think I should be able to program this in C. – Lilienthal Mar 08 '13 at 19:30
@Lilienthal yes, this looks about right. have a look at the MPI_SCATTER and MPI_REDUCE functions to distribute the work over the nodes, and you should be set. good luck :) – Andreas Grapentin Mar 08 '13 at 19:43
@Lilienthal I just remembered another great mpi application. I'll update my answer in a minute. – Andreas Grapentin Mar 08 '13 at 19:52
I've updated my original post as well. Thank you for bringing up what is apparently known as the Monte Carlo method of π calculation. I'll definitely look into getting that working as it would be more interesting to display. For now I'll make do with the sum program I've posted above but I'll look into expanding or refining the code once the system framework (i.e. the part I'm actually scored on) is up to spec. – Lilienthal Mar 08 '13 at 20:18
MPI is perfectly able to dynamically spawn additional processes when needed. It also provides fairly complicated mechanisms to support client/server applications, i.e. new processes can join the universe without being part of the initial launch or being spawned via the MPI process control facilities. – Hristo Iliev Mar 09 '13 at 08:37
Thank you for the correction Hristo. @AndreasGrapentin Thanks in large part to your suggestions I've managed to get a couple of nice demonstration scripts working, in particular the Monte Carlo algorithm you've suggested. Much obliged. – Lilienthal Mar 11 '13 at 20:29

score 2 · Answer 2 · answered Mar 09 '13 at 09:46

Like with any other computing paradigm, there are certain well established patterns in use with distributed memory programming. One such pattern is the "bag of jobs" or "controller/worker" (previously known as "master/slave", but now the name is considered politically incorrect). It is best suited for your case because:

under the right conditions it scales with the number of workers;
it is easy to implement;
it has built-in load balancing.

The basic premises are very simple. The "controller" process has a big table/queue of jobs and practically executes one big loop (possibly an infinite one). It listens for messages from "worker" processes and responds back. In the simplest case workers send only two types of messages: job requests or computed results. Consequently, the controller process sends two types of messages: job descriptions or termination requests.

And the canonical non-trivial example of this pattern is colouring the Mandelbrot set. Computing each pixel of the final image is done completely independent from the other pixels, so it scales very well even on clusters with high-latency slow network connects (e.g. GigE). In the extreme case each worker can compute a single pixel, but that would result in very high communication overhead, so it is better to split the image in small rectangles. One can find many ready-made MPI codes that colour the Mandelbrot set. For example this code uses row decomposition, i.e. a single job item is to fill one row of the final image. If the number of MPI processes is big, one would have to have fairly large image dimensions, otherwise the load won't balance well enough.

MPI also has mechanisms that allow spawning additional processes or attaching externally started jobs in client/server fashion. Implementing them is not rocket science, but still requires some understanding of advanced MPI concepts like intercommunicators, so I would skip that for now.

Thank you for your comment. Mandelbrot generation was interesting but I decided against using it as it doesn't lend itself well to displaying the output in Ubuntu Server's terminal. And thanks for clarifying these elements of MPI. They may be too advanced to implement for now but they are interesting to read and learn about nonetheless. — Lilienthal, Mar 11 '13 at 20:20

Parallel Demonstration Program

2 Answers2