I'm working with a SLURM driven HPC Cluster, containing of 1 control node and 34 computation nodes and since the current system is not exactly very stable I'm looking for guidelines or best practices on how to build such a cluster in a way that it becomes more stable and secure. And to be clear I'm not looking for detailed answers about resource management or maybe additional tools but for advises about the very basic setup (see "Question" below).
My current Setup
1 Control Node
This machine has slurm installed on /usr/local/slurm and runs the slurmctld daemon. The complete slurm directory (including all the executables and the slurm.conf) is exported.
34 Computation Nodes
These machines mount the exported slurm directory from the control node to /usr/local/slurm and run the slurmd daemon.
I don't use any Backup Control Node.
If our control node gets lost, it seems always a matter of luck if a currently running job will survive or not, so I'm looking for a way to create a more stable setup.
Possible issues with the current setup
1) The shared slurm directory. I couldn't find anything on the net if this is acutally a good or a bad practice, but since the slurm config file has to be the same on all machines, I thought I might as well share the complete slurm installation. But of course, if the compute node gets lost, all the files will become unavailable too.
2) The missing backup control node. This requires a shared NFS directory where the current state can be saved. The question would be, where should this directory be located? Of course it doesn't make sense to put it on the control node, but should it be on the backup control node? Or on an entire different machine?
Question
So, are there some guidelines to follow to build up an HPC cluster? Questions would be, what different kinds of nodes are involved, what is their job and what kind of data should be shared via NFS and where should those shared directories live? I would also be thankful about any kinds of literature or tutorials, that point me into the right direction.