1

I'm designing a head node whose primary function is to submit jobs to the Torque/Maui scheduler and secondary function is to run test jobs. Unfortunately, most hardware selection guides for clusters were written in 2000~2004 and are mostly irrelevant nowadays. I've been able decide most parts of the hardware configuration easily (e.g. NICs based on interconnect) but I don't understand how to choose the HDD/memory/processors.

  1. HDDs: Since I'm using network storage, am I correct that the size/type (SSD vs spindle) of HDD hardly matters, since these only need to meet the requirements of a typical boot drive?

  2. Memory: Assuming the test jobs are not memory-intensive, is there any performance advantage from having a large amount of memory on the head node? Job scheduling doesn't seem memory-intensive. If not, what's a rule of thumb to use to decide how much memory I need?

  3. Processor: Taking the test jobs out of the equation, are there any advantages for having more cores or higher clock frequencies on the processor? I'd imagine that that job scheduling is not computationally-intensive and hardly benefits from a faster processor or parallelism.

  4. Redundancy: How do you avoid the head nodes from being a SPOF? By having 2 or more head nodes? Do I leave the redundant head nodes completely passive (unused) - otherwise I imagine it will be extremely messy trying to recover from a dead head node? Is heterogeneity (different hardware specs) acceptable across head nodes? Is there any need for RAID mirroring of the boot drives on the head nodes?

elleciel
  • 389
  • 4
  • 11

1 Answers1

1

Even though this question is from 7yrs ago, i feel obligated to answer because i believe it is a good question and very relevant considering the prevalence of HPC these days.

  1. HDD. 7yrs ago the answer would have been HDDs are fine, because your network storage was most likely a 1gb ethernet. These days it depends on your network. HDDs give ~1.4gb/s write speed, which is fine for 1gbE but a complete waste if your network is infiniband (25gb/s on the low end) or even a 10gbE (very cheap these days). You probably want an SSD (~4gb/s write speed), but an m.2 ssd (25gb/s) is preferable only if you have the network bandwidth and have a decent budget. Overall, as long as you are not transferring large amounts of data to your compute node or network storage (for example, if you do AI trainings) then you can get by with an HDD with 1gbE.

  2. Memory. Assuming you dont have rogue or inexperienced users who run jobs on the head node, then your only consideration is number of users you expect to use the machine at the same time (i.e not total users). I would guesstimate at least 8gb for every 4 simultaneous users. Also, i assume your memory will be ECC.

  3. Processors. No, you are absolutely correct. As long as you are not a national research lab or some other case with a large number of users, a 4 core intel i3 processor with 16gb of ECC memory can make a cheap, fast, and reliable head node.

  4. RAID, plus regular backups with something like rsync. Having a head node crash is more of an inconvenience than a disaster. Assuming user data in on network storage, the head node would not contain any critical data. Head nodes can be rebuilt easily, though this does not imply that you should ignore redundancies.

ness
  • 11
  • 1