Lightweight distributed job queuing system for independent realtime jobs

Question

We are a small team in a Microsoft-only corporate environment. Our core task involves running a sets of about 100 independent runs of an in-house tool. Each run has one input file and multiple output files and the single-job run time is around one hour (jobs are single-threaded but heavily optimized: the single job run time will not be reduced).
We are looking for a way to distribute these runs to available CPU cores to bring the wall-time for the full set of runs down to an hour (i.e., the single job run time).

Our perfect setup would probably be something along these lines:

Simple to install worker client (if any), to make it easy for users to let their own workstations join the queue.
Workers can join the pool dynamically (with a specified number of cores)
Real time job queue manipulation, including adding and cancelling jobs

There are plenty of job scheduling systems, but most appear to be way more complex than what we need (job dependencies, repeated jobs,...). This might not be a problem -- but through all that complexity it is hard to figure out which systems meet out demands. Do you have experience with existing systems that meet our demands?

I have also considered a simple worker daemon watching for job files on a network drive. Do you have experience with this approach?

score 0 · Answer 1 · answered May 22 '12 at 13:34

0

I'm not sure how this could be answered without knowing what your in-house tool is doing. What is making it slow? Where's the bottleneck? How independent are the job runs, and how independent are each piece of the job so it could be broken up? Is the application already multithreaded or supports multithreading to take advantage of the cores on the system?

You might want to profile your applications to see where the bottleneck(s) is/are then focus on refactoring them. A small change in the application doing the job could yield big results.

Without some understanding of what the job is and where the slowdown is then it's difficult to tell how to split up your batch jobs. Without knowing how your job process can be broken apart you might have to determine what the biggest bottleneck is and throwing more hardware at it (faster disk subsystem, more memory, faster processor...)

EDIT - If these are totally independent jobs, it might be worth looking at if your bottleneck is the serialization alone, and maybe you could run them on virtual servers or run them in something like instances on Amazon's "cloud" or getting a farm of cheap systems that run the job and submits them back to your main program. I'm just not sure with the description if you should look at what it would take to build this support into your in-house application rather than trying to use some kind of external job scheduler.

answered May 22 '12 at 13:34

Bart Silverstrim

31,172
9
67
87

Yes, the jobs are totally independent -- but also atomic: These are heavy structural optimization calculations, and the single-job run time is about as low as we expect to get it (for the current CPU frequencies, at least). With respect to the farm of cheap systems: this is exactly what we are aiming for -- but I figured that the problem is so generic that maybe an good scheduler for this special case already existed. – Janus May 22 '12 at 14:07
Edited the question to (hopefully) clarify this. Also: our goal is very modest: to take the turnaround time down from days (jobs run serially on one CPU) to hours (jobs run in parallel on multiple machines). I agree that it should not be complicated to set up our own system for this -- but I would rather just use an existing system. – Janus May 22 '12 at 14:13
Do they actually max out the CPU, or is it because they're being run in serial? And why wouldn't it already be getting doled out to the available cores by the scheduler? I guess I'm a little confused why if you have access to the source (and I'd assume the programmers) that the initial process can't run them on the available cores. If the limitation is the serialization you could just try it on a hefty server with VM's to create your own virtual farm. – Bart Silverstrim May 22 '12 at 14:14
Yes, the CPU is the limiting factor in the sense that each job maxes a single core. In our current setup, we simply run the jobs 12 at a time on a 12-core CPU and let the process scheduler take care of things. What I am looking for is a simple way to bring more servers into the mix. – Janus May 22 '12 at 21:02

Lightweight distributed job queuing system for independent realtime jobs

1 Answers1