MPI - Add/remove node while program is running

Question

Is there an MPI implementation that allows nodes to be dynamically added/removed at runtime? Do any recover from complete hardware failure of a node, allowing the node to be repaired and relaunched without restarting the program?

Wesley Bland · Accepted Answer · 2014-11-12T13:10:15.817

Is there an MPI implementation that allows nodes to be dynamically added/removed at runtime?

This is actually two questions. Nodes usually can be dynamically added at runtime using calls like MPI_Comm_spawn. As @Hristo pointed out in the comments, you should set the correct info key in Open MPI. It may also be possible in other implementations. As for removing nodes, that's a big area of research at the moment. Most MPI implementations currently have varying levels of success surviving a total node failure. In the current releases of Open MPI, I don't believe there is any support for that sort of failure [citation needed], though there is work to improve that ongoing. In the current version of MPICH, you can pass the flag -disable-auto-cleanup to mpiexec and it will not automatically clean up your application after a process/node failure. However, you'll still have to modify your MPI application to handle this situation. The various derivatives of MPICH (Intel MPI, Cray MPI, IBM MPI, MVAPICH, etc.) all don't support this feature AFAIK. There are other research implementations that are also available to extend the support of the MPI Standard. User Level Failure Mitigation is currently being considered by the standardization body as a way of letting the user handle process failures. There is a research implementation based on Open MPI available at the website linked, and an experimental prototype will also be in the next version of MPICH (3.2).

Do any recover from complete hardware failure of a node, allowing the node to be repaired and relaunched without restarting the program?

This is essentially the same process as above. You would need to use the APIs to remove a process and then somehow find out that it's available and add it back using spawn. These calls have to be made from inside the application though, not externally.

Open MPI allows new nodes to be added to the host list by setting the `add-host` or the `add-hostfile` property in the `MPI_Info` object passed to `MPI_Comm_spawn`. This feature has been present ever since version 1.5. — Hristo Iliev, Nov 12 '14 at 11:32

MPI - Add/remove node while program is running

1 Answers1

Linked