How to communicate between nodes of a cluster?

Question

This isn't question about a specific cluster environment, but rather about the general case of distributing software over multiple nodes on a cluster.

I understand that most HPC clusters use some kind of workload manager to distribute jobs to multiple nodes. From my limited research Slurm seems to be a popular choice but others are also in use.

I can see how this is useful if you want to run n independent tasks. But what if you wanted to run tasks that communicate with one another?

If I were developing an application that was split across two or more machines I could just design a simple protocol (or use an existing one) and send/receive messages over something like TCP/IP. If things got really complicated it wouldn't be too hard to design a simple message bus or message hub to accommodate more than two machines.

Firstly, in an HPC cluster is it sensible to use TCP, or is this generally not used for performance reasons?

Secondly, in a non-cluster environment I know beforehand the IP addresses of the machines involved, but on a cluster, I delegate the decision of which physical machines my software is deployed on to a workload manager like Slurm. So how can I "wire up" the nodes? How does MPI achieve this, or is it not using TCP/IP to allow communication between nodes?

Sorry if this question is a little open-ended for StackOverflow, I'm happy to move it somewhere else if there's a more appropriate place to ask questions like these.

score 2 · Accepted Answer · answered Oct 13 '18 at 22:08

If I were developing an application that was split across two or more machines I could just design a simple protocol (or use an existing one) and send/receive messages over something like TCP/IP

And so there came MPI so not everyone would reinvent the wheel (and the wheel is several thousand hours of engineering time, it is not your basic chariot wheel, it has gone through some very bumpy roads...).
But eventually that's what MPI does (in the case where you want your communications to go through TCP see OpenMPI TCP)

Firstly, in an HPC cluster is it sensible to use TCP, or is this generally not used for performance reasons?

They are other means of communication than TCP (shared memory, Myrinet, OpenFabrics communications,...) OpenMPI FAQ). In HPC they are a few solutions on the market concerning Interconnect (look at Top 500)

So how can I "wire up" the nodes? How does MPI achieve this, or is it not using TCP/IP to allow communication between nodes?

The wiring is managed by the workload manager (you can have a look at slurm configuration or loadleveler). MPI will just "inherit" from that context because in a HPC context you stop using mpirun but more likely srun or runjob (instead of doing something like Specify the machines running program using MPI)

Thanks for the response. I've looked at the docs you linked to and read about running jobs in Slurm specifically. One way I could imagine "wiring up" up nodes is by using `srun` with the `-w` flag to specify which nodes to run an executable on. That way I can pass the node list (as a simple array of hostnames) as a parameter to each executable's `main` function. @NPE Is that the kind of thing you were getting at when you said the workload manager does the "wiring up"? Or is there a more sophisticated way of doing this? (I'm talking about making my own comms using TCP btw, not using MPI) — Oli, Oct 13 '18 at 23:51
No there is no more sophisticated way. When people communicate they have to know each other adresses (unless they are in the same room… multicore) same with machines.. — PilouPili, Oct 14 '18 at 00:00

How to communicate between nodes of a cluster?

1 Answers1