What simple mechanism for synchronous Unix pooled processes?

Question

I need to limit the number of processes being executed in parallel. For instance I'd like to execute this psuedo-command line:

export POOL_PARALLELISM=4
for i in `seq 100` ; do
    pool foo -bar &
done

pool foo -bar # would not complete until the first 100 finished.

Therefor despite 101 foos being queued up to run, only 4 would be running at any given time. pool would fork()/exit() and queue the remaining processes until complete.

Is there a simple mechanism to do this with Unix tools? at and batch don't apply because they generally invoke on the top of the minute as well as execute jobs sequentially. Using a queue is not necessarily the best because I want these synchronous.

Before I write a C wrapper employing semaphores and shared memory and then debug deadlocks that I'll surely introduce, can anyone recommend a bash/shell or other tool mechanism to accomplish this.

What are you trying to accomplish overall? The mechanism you state here sounds like it might not be the correct solution for your likely problem. — Perry, Feb 27 '12 at 23:03
There is a certain TCP port connection hungry program. When it runs it makes hundreds of outbound connections. It is triggered by an HTTP CGI request. Normally not an issue, but under a medium web spike if I could put a bound on the number of these that get invoked at any given time (even if they block for a short while) that would help manage said connections (the network admins hate this app). — Jé Queue, Feb 27 '12 at 23:16
You might be better off turning said program into an event driven single server with a job queue -- it would radically lower your overall load. Google for "The C10K Problem" for details on why this is a common sort of issue to face. — Perry, Feb 27 '12 at 23:30
Are you saying that for *any given* CGI request, you want a maximum of four instances of `foo -bar` running simultaneously? Or are you saying that in total, across *all* CGI requests, you want a maximum of four instances of `foo -bar` running simultaneously? — ruakh, Feb 27 '12 at 23:41
@ruakh, not for EVERY CGI request, just this one particular CGI program (it creates a chart from lots of data sources). It's low volume, but when the workforce comes in the morning they all hit at the same time and many thousands of connections originate. I can't throttle at the TCP or HTTP level as that would effect the rest of the site. So I'm thinking I can wrap this CGI program and bound how many concurrent request happen across all requests for this CGI program (your latter scenario) — Jé Queue, Feb 28 '12 at 00:02

sarnold · Accepted Answer · 2012-02-27T23:43:44.777

There's definitely no need to write this tool yourself, there's several good choices.

`make`

make can do this pretty easy, but it does rely extensively on files to drive the process. (If you want to run some operation on every input file that produces an output file, this might be awesome.) The -j command line option will run the specified number of tasks and the -l load-average command line option will specify a system load average that must be met before starting new tasks. (Which might be nice if you wanted to do some work "in the background". Don't forget about the nice(1) command, which can also help here.)

So, a quick (and untested) Makefile for image converting:

ALL=$(patsubst cimg%.jpg,thumb_cimg%.jpg,$(wildcard *.jpg))

.PHONY: all

all: $(ALL)
        convert $< -resize 100x100 $@

If you run this with make, it'll run one-at-a-time. If you run with make -j8, it'll run eight separate jobs. If you run make -j, it'll start hundreds. (When compiling source code, I find that twice-the-number-of-cores is an excellent starting point. That gives each processor something to do while waiting for disk IO requests. Different machines and different loads might work differently.)

`xargs`

xargs provides the --max-procs command line option. This is best if the parallel processes can be divided apart based on a single input stream with either ascii NUL separated input commands or new-line separated input commands. (Well, the -d option lets you pick something else, but these two are common and easy.) This gives you the benefit of using find(1)'s powerful file-selection syntax rather than writing funny expressions like the Makefile example above, or lets your input be completely unrelated to files. (Consider if you had a program for factoring large composite numbers in prime factors -- making that task fit into make would be awkward at best. xargs could do it easily.)

The earlier example might look something like this:

find . -name '*jpg' -print0 | xargs -0 --max-procs 16 -I {} convert {} --resize 100x100 thumb_{}

`parallel`

The moreutils package (available at least on Ubuntu) provides the parallel command. It can run in two different ways: either running a specified command on different arguments, or running different commands in parallel. The previous example could look like this:

parallel -i -j 16 convert {} -resize 100x100 thumb_{} -- *.jpg

`beanstalkd`

The beanstalkd program takes a completely different approach: it provides a message bus for you to submit requests to, and job servers block on jobs being entered, execute the jobs, and then return to waiting for a new job on the queue. If you want to write data back to the specific HTTP request that initiated the job, this might not be very convenient, as you have to provide that mechanism yourself (perhaps a different 'tube' on the beanstalkd server), but if the end result is submitting data into a database, or email, or something similarly asynchronous, this might be the easiest to integrate into your existing application.

All these seem like bad choices given the problem domain the poster explained above. Indeed, I think none of these are terribly similar to the sort of thing the poster wants -- make and xargs are especially unsuitable. — Perry, Feb 27 '12 at 23:34
@Perry: agreed. I think they were fair fits when the task looked shell-script-oriented. I've added a new mechanism, `beanstalkd`, that provides for a work-queue approach -- it ought to integrate into most "web apps" far easier. — sarnold, Feb 27 '12 at 23:45
@sarnold - I've not used `beanstalkd`, but this is the right approach minus the asynchrony, I need this synchronous if not delayed (but I'll look through config docs). If `parallel` could coordinate across invocations, that would work, but I don't think it does? — Jé Queue, Feb 28 '12 at 00:05
The top three mechanisms I suggested, `make`, `xargs`, and `parallel` all assume that the _one_ invocation is the complete and total source of job requests. I don't think they translate well to CGI-land. With `beanstalkd`, you could have your CGI add new jobs, then sit around waiting for a response to be sent back (as a new "job request") on the beanstalk, but it would be up to you to write it... — sarnold, Feb 28 '12 at 00:28

What simple mechanism for synchronous Unix pooled processes?

1 Answers1

`make`

`xargs`

`parallel`

`beanstalkd`

Related