C/C++ Framework for distributed computing in a dynamic cluster

Question

I am looking for a framework to be used in a C++ distributed number crunching application.

The setup looks as follows:

There is a master node which divides the problem domain into small independent tasks. The tasks are distibuted to worker nodes of different capability (e.g. CPU type/GPU-enabled). Worker nodes are dynamically added to the compute grid, as they become available. It may also happen that a worker node dies, without saying good bye.

I am searching for a fast C/C++ framework to accomplish this setup.

To summarize, my main requirements are:

Worker/Task-scheduling paradigm
Dynamically add/remove nodes
Target network: 1G - 10G ethernet (corporate network, good performance over internet not required)
Optional: Encrypted and authenticated communication

I know companies provide solutions where you rent their computer power of multiple PCs. — Kirill Kobelev, Jul 12 '12 at 08:26

High Performance Mark · Answer 1 · 2012-07-12T10:40:32.200

5

You can certainly do what you want with MPI. MPI-2 added dynamic process management features, and I think most of the currently widely-used implementations offer these.

One of the advantages of using C++ + MPI is that the combination is quite widely used in scientific and technical computing, though my impression is that within this niche dynamic process management is not used very much. Since MPI is used on the very largest supercomputers tackling the bleeding-edge problems of computational science, one might hazard a guess that it would be fast enough for your purposes.

One of the disadvantages of using C++ + MPI is that MPI was not designed to tolerate failure of processes during execution. There is debate on SO about whether or not the dynamic process management features allow you to program your own fault tolerance. But no debate that it might be difficult.

You would get the first 3 of your requirements 'out-of-the-box'. As for encrypted and authenticated communication, you'd have to do most of that yourself, MPI just passes messages around. I'd guess that for most MPI users, running parallel applications on clusters or supercomputers with private interconnects (often themselves isolated from corporate or enterprise networks), encryption and authentication are matters of little concern.

edited Jul 12 '12 at 10:40

answered Jul 12 '12 at 09:09

High Performance Mark

77,191
7
105
161

1

"It may also happen that a worker node dies, without saying good bye." And you MPI job says bye-bye as MPI-2 has no fault tollerance support whatsoever... – Hristo Iliev Jul 12 '12 at 09:13
If I run `mpixec` on a server and run it separately on a client, I get two instances of `MPI_COMM_WORLD` which can communicate with each other using MPI_Open_port (etc). If the client dies only the communicators which have processes on the client die with it. Alternatively the server could call `MPI_Close_port` after dispatching the job to the client and both proceed independently. Or am I mistaken ? And note, I'm not stating that this is any kind of 'best' way of satisfying OP's requirements, just a way. – High Performance Mark Jul 12 '12 at 09:40
I don't think `MPI_Open_port` and `MPI_Comm_connect` work across MPI universes and I'm not sure that you can extend an MPI universe with new nodes once it was created. `MPI_Comm_spawn` might allow you to specify hostnames in the info argument but that would be highy non-portable. Besides he says that tasks are independent and I think that BOINC provides the required mature environment with fail resistance, task rescheduling, client classes and so on. – Hristo Iliev Jul 12 '12 at 10:32
Thanks for the comment. In fact I was looking at MPI for this project, but due to it's missing "native" support for fault tolerance it did not get my first choice. However, I totally agree with you that using MPI certainly has advantages regarding the community behind it and that it has prooved to be good ... – Erik Jul 12 '12 at 10:56

C/C++ Framework for distributed computing in a dynamic cluster

1 Answers1