0

I work in a research group doing a lot of Machine Learning and Computational Biology.

We currently have a cluster, but it is poorly maintained, suffers from low I/O throughput, and most critically doesn't have any setup for scheduling or load-balancing. Therefore, to use it, you have to find a free node yourself, ssh into that node, run your script on the command line, and manually collect your results.

What is the best software stack to implement an easy to use scheduler and load-balancer, such that users can submit their job to a central queue, have it run automatically when resources are available, and easily get their results back?

oceanhug
  • 1,352
  • 1
  • 11
  • 24
  • 1
    If you're at a University in Toronto, you're probably best off sitting down with your local HPC consortium and getting detailed advice tailored to your particular workflow; There's [SciNetHPC](http://www.scinethpc.ca/) at UofT, [SHARCNET](https://www.sharcnet.ca) has offices at York and UOIT, and [HPCVL](http://www.hpcvl.org/) at Ryerson - you could go visit any of them, but those are the ones "at" each of the schools. – Jonathan Dursi Nov 22 '12 at 17:30

1 Answers1

2

There's a number of scheduler/resource manager options that are open source and well thought of:

  • Torque/Maui, descendants of the venerable PBS, now maintained by adaptive computing
  • Slurm, a newer project out of LLNL, which has the advantage that it scales very well
  • Open Grid Engine, née Sun Grid Engine

But there's also a number of entire software stacks that aim to make managing a cluster easier:

I'm making this a community wiki for others who have suggestions.

Jonathan Dursi
  • 50,107
  • 9
  • 127
  • 158