Efficiency of running many more jobs than CPUs

Question

I have a large set of jobs to run (thousands), each one takes between 30 minutes to a few hours on a single CPU. The memory requirements are small (few KB each). I'm working on a small linux cluster that has a few dozen CPUs. So far, I've been starting them running a few at a time, trying to manually keep the cluster busy.

My question is: what happens if I submit hundreds or thousands at once -- far more than the number of CPUs? It's clear that each job will take longer to run individually, but I am wondering about the overall efficiency of this method vs. having exactly one job per CPU at a time. I could also write a more complicated method to monitor the progress and keep each CPU occupied with exactly one job (e.g. using multiprocessing in Python), but this would take up costly programmer time, and I'm wondering whether the end result would really be any faster.

score 4 · Answer 1 · answered May 19 '15 at 06:44

Like so many things, it depends.

If you have I/O or remote processing such as file work, database access, web service or other remote calls then often there is plenty of free CPU time waiting for these to finish. In these cases there is often benefit in having more jobs than CPUs. There is obviously some limit to this but working out and addressing the exact threshold would come under your "costly programmer time".

CPU-bound process will most likely clog up as you add processes.

Again for CPU bound, rather than a "push" method you describe is to flip it on it head. Have a queuing mechanism where the worker threads/processes (1 per CPU) pull work from the master queue. The master queue is lightweight, goes to sleep when it's not be asked for anything and the workers just chew through the work.

All said and done though it's really hard to give you a definitive answer without knowing the problem in more detail.

Good luck though!

Thanks for the insightful answer. My processes are probably CPU-bound, so I'll be taking a look at some of the libraries Ike suggested. — user2509951, May 19 '15 at 23:12

score 3 · Accepted Answer · 2015-05-19T07:15:45.433

Speed-wise, you're unlikely to get a performance boost spawning more threads than there are physical threads available unless your threads are spending a lot of time sleeping (in which case it gives your other threads an opportunity to execute). Note that thread sleeps can be implicit and hidden in I/O bound processes and when contending a lock.

It really depends on whether your threads are spending most of their time waiting for something (ex: more data to come from a server, for users to do something, for a file to update, to get access to a locked resource) or just going as fast as they can in parallel. If the latter case, using more threads than physically available will tend to slow you down. The only way having more threads than tasks can ever help throughput is when those threads waste time sleeping, yielding opportunities for other threads to do more while they sleep.

However, it might make things easier for you to just spawn all these tasks and let the operating system deal with the scheduling.

With vastly more threads, you could slow things down potentially (even in terms of throughput). It depends somewhat on how your scheduling and thread pools work and whether those threads spend time sleeping, but a thread is not necessarily a cheap thing to construct, and a context switch with that many threads can become more expensive than your own scheduling process which can have a lot more information about exactly what you want to do and when it's appropriate than the operating system who just sees a boatload of threads that need to be executed.

There's a reason why efficient libraries like Intel's Thread Building Blocks matches the number of threads in the pool to the physical hardware (no more, no less). It tends to be the most efficient route, but it's the most awkward to implement given the need for manual scheduling, work stealing, etc. So sometimes it can be convenient to just spawn a boatload of threads at once, but you typically don't do that as an optimization unless you're I/O bound as pointed out in the other answer and your threads are just spending most of their time sleeping and waiting for input.

If you have needs like this, the easiest way to get the most out of it is to find a good parallel processing library (ex: PPL, TBB, OMP, etc). Then you just write a parallel loop and let the library focus on how to most efficiently deal with the threads and to balance the load between them. With those kinds of cases, you focus on what tasks should do but not necessarily when they execute.

Thanks very much! This is great detail. I'll have a look at some libraries. My underlying code is F90 (legacy), but I submit the jobs in python; so probably the easiest place to start is with python's multiprocessing library? — user2509951, May 19 '15 at 23:17
Ah yes, there `map` and `map_async` are your friend. They accept a function and iterable, so you can just specify a big range of elements and a function to call for each element and leave it up to the library to efficiently allocate and schedule threads from the pool to tackle that range. — , May 20 '15 at 03:36
It's worth noting that `chunk_size` parameter. You might actually be able to get speed ups using a value greater than 1 if each task only does a light amount of processing. What it does is allow the library to allocate one thread to handle more than one task which can help because multithreading comes with a bit of overhead over a simple for loop, so you want each iteration to do a sufficient amount of work. There you can just specify the entire range to process in parallel -- this is different from creating an individual thread for the entire range, because the library will smartly... — , May 20 '15 at 03:39
... allocate and assign an optimal number of threads to tackle portions of the range you specify at once and then make the same threads handle other portions of the range when they're done. — , May 20 '15 at 03:40
Good answer. One small remark: "There's a reason why efficient libraries like Intel's Thread Building Blocks matches the number of threads in the pool to the physical hardware (no more, no less)." -- Hardware threads isn't an established term. Most libraries that I know use 4x #processor cores. If you would look at SMT (hyperthreading et al) you would end up with 2x #cores on Intel, which isn't the case. — atlaste, May 20 '15 at 05:59

score 3 · Answer 3 · answered May 19 '15 at 07:01

3

If you use threads, it's generally a better idea to use thread pooling. If you don't, your CPU will be clogged with context switching. That said, kernels obviously use tricks to ensure that this isn't really a problem in all cases.

My experience with (small) processes that combined use a ton of CPU power is that it's best to limit the number of threads to -say- 4 * processor count. There's usually some startup period etc. which is why the 4* is there.

If you use async-stuff, it will probably automatically use tricks like polling and thread pooling, which means it will work just fine. My experience here is that async stuff usually works better than threading for IO.

answered May 19 '15 at 07:01

atlaste

30,418
3
57
87

Thanks! This is a good rule of thumb. Can you expand more on async-stuff? – user2509951 May 19 '15 at 23:18
Async libraries usually use different techniques to keep the number of threads in check. For example, instead of running 1 thread per (network) socket, you can also use things like epoll ( http://en.wikipedia.org/wiki/Epoll ). Basically the rule of thumb here is: if async is available in your library and you need scalability, it's probably the best way to go. – atlaste May 20 '15 at 06:01

Efficiency of running many more jobs than CPUs

3 Answers3