We are a small team in a Microsoft-only corporate environment. Our core task involves running a sets of about 100 independent runs of an in-house tool. Each run has one input file and multiple output files and the single-job run time is around one hour (jobs are single-threaded but heavily optimized: the single job run time will not be reduced).
We are looking for a way to distribute these runs to available CPU cores to bring the wall-time for the full set of runs down to an hour (i.e., the single job run time).
Our perfect setup would probably be something along these lines:
- Simple to install worker client (if any), to make it easy for users to let their own workstations join the queue.
- Workers can join the pool dynamically (with a specified number of cores)
- Real time job queue manipulation, including adding and cancelling jobs
There are plenty of job scheduling systems, but most appear to be way more complex than what we need (job dependencies, repeated jobs,...). This might not be a problem -- but through all that complexity it is hard to figure out which systems meet out demands. Do you have experience with existing systems that meet our demands?
I have also considered a simple worker daemon watching for job files on a network drive. Do you have experience with this approach?