This isn't question about a specific cluster environment, but rather about the general case of distributing software over multiple nodes on a cluster.
I understand that most HPC clusters use some kind of workload manager to distribute jobs to multiple nodes. From my limited research Slurm seems to be a popular choice but others are also in use.
I can see how this is useful if you want to run n
independent tasks. But what if you wanted to run tasks that communicate with one another?
If I were developing an application that was split across two or more machines I could just design a simple protocol (or use an existing one) and send/receive messages over something like TCP/IP. If things got really complicated it wouldn't be too hard to design a simple message bus or message hub to accommodate more than two machines.
Firstly, in an HPC cluster is it sensible to use TCP, or is this generally not used for performance reasons?
Secondly, in a non-cluster environment I know beforehand the IP addresses of the machines involved, but on a cluster, I delegate the decision of which physical machines my software is deployed on to a workload manager like Slurm. So how can I "wire up" the nodes? How does MPI achieve this, or is it not using TCP/IP to allow communication between nodes?
Sorry if this question is a little open-ended for StackOverflow, I'm happy to move it somewhere else if there's a more appropriate place to ask questions like these.