5

I am looking for a program to perform distributed computing (no parallel computing needed though) which has:

  • a scheduler
  • a queue management (FIFO, or preferably something more advanced)
  • a good statistics report
  • ability to run on a heterogeneous cluster (a set of machines with different characteristics such as cpu and memory)
  • and very important: a good responsivness (a few seconds maximum between the trigger of the task and the actual start of the execution: I have heard that this may be tricky to achieve with HTCondor and TORQUE? What about Apache Mesos?)
RockScience
  • 17,932
  • 26
  • 89
  • 125
  • Have a look at [Apache Mesos](http://mesos.apache.org/) – Dima Chubarov Jan 15 '16 at 07:54
  • @DmitriChubarov This would add another level of abstraction. Is it not slowing down the response time? – RockScience Jan 15 '16 at 08:19
  • Apache Mesos provides [a list of frameworks](http://mesos.apache.org/documentation/latest/frameworks/) that might well suit your needs with an added advantage of sharing resources between multiple frameworks. Containers are not a requirement. – Dima Chubarov Jan 15 '16 at 08:53
  • Can you expand what you mean by "a good statistics report"? I know that Torque can handle these other requirements easily. I'm not very familiar with HT Condor but I suspect it can as well, although Torque has a much larger user community. – dbeer Jan 18 '16 at 18:25
  • Just in case if you find it useful, have you checked the Spring XD project? http://projects.spring.io/spring-xd/ – Sergey Shcherbakov Feb 09 '16 at 19:20
  • Messos can handle all of the above and will scale with you if you need to scale up or have different workloads later on, if your use case is just one single workload, Torque, Slurm or Nomad might be the simple answer. – Walid Mar 26 '16 at 17:22

1 Answers1

1

There is a quite large wikipedia page with comparisons, but you will hardly find large differences. My guess would be that most things could theoretically be done in either framework. The things you list all depend on perspective (people e.g. commonly write their own sophisticated statistics from HTCondor logs). Regarding responsiveness: HTCondor works fine to schedule interactive notebooks if there are enough ressources for the workers to pick up the job. Few seconds is often no problem, but there are hardly guarantees. These are High Throughput Systems, but not low-latency systems. You should preallocate workers and scale them up and down if you care for latency (here supports for other frameworks on top helps much more than native latency).

I try my best to highlight the main foci of each Project from my perspective, that are important for a practical decision:

Target audience

Mesos:

vs.

Both HTCondor & Torque:

  • fair-share batch processing particularly in scientific clusters (High Throughput Computing)

Eco-system

Mesos:

vs.

HTCondor:

vs.

TORQUE:

Ease of use

(partially this is statistics, but more the dashboard style)

Mesos & TORQUE:

  • Web UI
  • commonly integrations with other frameworks available (for TORQUE look for PBS)

HTCondor:

  • new, developing REST and python interaces but no common GUI
  • lagging behind a tiny bit in framework support (R batchtools, lately is has had dask support)
Community
  • 1
  • 1
till
  • 570
  • 1
  • 6
  • 22