HPC Cluster planning workflow?

Question

After three days of intensive Google searching, I have not found any high-level workflow of how to build a low profile - cheap - computing cluster (we are not interested in HA yet). This is just a front-end plus a node for now.

We want to start small with rockscluster, provide a web-based server for offering services, and then add nodes as our budget increases. We're small company, so we haven't enough human resources to implement it smoothly. Here are some facts about our environment:

Our hardware is not constant (we will add nodes).
Our workload will vary (in the order from 200Mb - 1Tb)
Our software will change (scientific applications for data mining)

Do you know any visual workflow, worksheet, chart, describing the general necessary steps to begin our cluster planning?

You need to have some idea of what technology to will use. For example clustering is built in to Windows 2012, Linux has clustering to as does virtualisation hypervisor like VMware's vSphere. Or you could choose a cloud based hosting model like Azure and pay per month but scale out as you need. — Bernie White, Oct 22 '12 at 06:00
For the love of all that is holy, please get your units straight. You probably don't mean bits. — pfo, Oct 22 '12 at 14:14

pfo · Answer 1 · 2012-10-22T09:21:56.300

A short sketch: The most important part will be planing of your networks and the physical installation. You'll probably use at least three or four networks. One for management where you provision, monitor, deploy, configure your nodes, one for storage access, and another one for things like MPI and internode communication and an out-of-band network where your IPMI ports sit. Prepare to come up with a concept to scale these networks physically and logically, also in terms of capacity and performance.

You didn't say if you want to use MPI applications, if so you'll probably want some kind of fast interconnect between your nodes. Something like Infiniband or 10G/40G Ethernet is a huge advantage for anything that uses MPI. Not only in terms of bandwidth (current gen Infiniband offers 56GBps) but also MPI on top of RDMA is freeing your node CPUs from networking tasks. IB is also very nice in terms of latency with real world latencies in the order of a microsecond.

There are two basic cluster topologies out there. The one where the internal compute network and the compute nodes are not visible directly on your regular corporate network and where users have a defined entry point into the system. Usually this is implemented as some kind of SSH load balanced login machines. Users log into these machines and can submit jobs to the compute nodes etc.

The other approach is to have your compute nodes be visible from your corporate network and users can submit on any of those to any other of the compute nodes. This setup has the drawback that your compute nodes will need to have some kind of security management as they are not within a protected network. You probably don't wont this setup except your workload requires this and even in that case you will probably want some kind of firewalling/routing setup.

You will need an OS deployment/setup infrastructure. Depending on your OS and how your nodes will look like there are multiple options. An easy one would be IBM's xcat. You could go the DIY route and deploy your nodes with something like foreman and manage the configuration via puppet. Larger deployment have dedicated provisioning and configuration (management) nodes. And just BTW rocks is dead, don't put your eggs into that basket.

These management nodes will need to provide DHCP, DNS, routing/bridging between networks, mirror upstream software repositories, maintain deployment images, maybe offer boot services for compute nodes etc. The network is very important part of the setup and needs careful planing!

These nodes themselves form some kind of "clustered setup" (ie it's common to have certain services HA protected) to enable reliable and fast reinstall of your compute nodes, monitor operations, and have these nodes not influenced by the load on your compute nodes. It is important to protect these nodes from users or any other non-dedicated load.

It's common practice to design your service/management nodes in a hierarchical fashion. Ie you will have maybe a HA cluster of two nodes that deploy the next level of management nodes (can be HA or not) and these nodes are only responsible for deploying and managing maybe one or two racks of machines.

You will need some kind of shared file system that is visible to your compute portion. As your setup seems to be very small a good route is NFS, export it from your server onto your nodes and put that into your configuration management. It's common practice to use several file systems for different purposes (home directories, scratch/working directories, shared project/group directories, a software directory etc.). Since these have very varying requirements in terms of performance and capacity you'll need to figure out some kind of storage strategy for managing these.

Note that scalable setups don't use NFS for more than home directories and software, for large IO requirements people use things like Lustre, GPFS, PanFS, and many others.

There's also the thing about how to manage user accounts and groups. Since you'll have a shared file system you will need consistent UID/GIDs across your cluster. Some kind of central directory is usually the approach here.

After you have a basic OS deployment ready and have your shared storage setup, you must have a scheduler software installed so that your user don't step on each others toes. A very good one. The scheduler software (actually a distributed resource manager) is the piece of software that gives your a true compute cluster, without the scheduler this is just a bunch of machines that are connected to a network. There are also many other schedulers and resource managers out there, like Torque, SGE/OGE, the marvellous SLURM or commercial ones like Platform LSF and Altair's PBSPro et al. The DRM is the one thing responsible for launching/terminating user jobs and monitoring resource usage, maintain a queue of jobs/tasks etc. Modern schedulers already now of things like Hadoop and how to startup and teardown these. Also the scheduler software will have some kind of accounting system available that offers you insight into the usage of your machines.

You will probably want to have some system for managing self-rolled software that you install on a shared file system via something like Environment Modules. Modules enable you to have versioned user environment and provide central software for your users.

As said this is just a sketch on how to approach this. There are many details on how to implement all of the above in a sane and scalable fashion. Most HPC clusters grow in terms of performance and storage requirements. Plan accordingly.

The important thing of your cluster is to yield a balanced design and to have your cluster architecture to match your workload. Ie you have radically different designs for data heavy ("big data") workloads where users maybe run GNU R scripts and very different designs where users solve differential equations on geometry.

There are also many approaches on how to design a cluster for things like big data crunching (not only hadoop!). I'm currently designing a large data _intensive_ setup with 500TB on Isilon storage and with careful planing on the storage layers. — pfo, Oct 22 '12 at 09:24

HPC Cluster planning workflow?

1 Answers1