Best solution for running multiple intensive jobs at specific times

Question

We have a web app that uses IMAP to conditionally insert messages into users' mailboxes at user-defined times.

Each of these 'jobs' are stored in a MySQL DB with a timestamp for when the job should be run (may be months into the future). Jobs can be cancelled at anytime by the user.

The problem is that making IMAP connections is a slow process, and before we insert the message we often have to conditionally check whether there is a reply from someone in the inbox (or similar), which adds considerable processing overhead to each job.

We currently have a system where we have cron script running every minute or so that gets all the jobs from the DB that need delivering in the next X minutes. It then splits them up into batches of Z jobs, and for each batch performs an asynchronous POST request back to the same server with all the data for those Z jobs (in order to achieve 'fake' multithreading). The server then processes each batch of Z jobs that come in via HTTP.

The reason we use an async HTTP POST for multithreading and not something like pnctl_fork is so that we can add other servers and have them POST the data to those instead, and have them run the jobs rather than the current server.

So my question is - is there a better way to do this?

I appreciate work queues like beanstalkd are available to use, but do they fit with the model of having to run jobs at specific times?

Also, because we need to keep the jobs in the DB anyway (because we need to provide the users with a UI for managing the jobs), would adding a work queue in there somewhere actually be adding more overhead rather than reducing it?

I'm sure there are better ways to achieve what we need - any suggestions would be much appreciated!

We're using PHP for all this so a PHP-based/compatible solution is really what we are looking for.

Better in what way? You could specify what exactly you don't like in your current setup. — Ranty, Jan 16 '12 at 10:29
It sounds like you are doing things in a relatively sensible way. Often for greatest scalability a service orientated architecture (SOA) will be the best solution, which is what it sounds like your doing with posting over HTTP to other servers. See: http://en.wikipedia.org/wiki/Service-oriented_architecture — Paul Bain, Jan 16 '12 at 10:45
The added overhead of the HTTP requests feels wasteful to me, and I was wondering if there was a better way to achieve the same result (i.e. spread the processing load across servers). Also we split up the currently pending jobs into batches of Z jobs, but that z number is fairly arbitrary and not responsive to actual memory use or the time the processes spend running. But really I just want some validation that this is not a crazy way to be doing things and that I haven't missed a much simpler or more efficient way of doing this! :-) — Mark Perkins, Jan 16 '12 at 20:55

Alister Bulman · Answer 1 · 2012-01-25T10:42:12.790

Beanstalkd would be a reasonable way to do this. It has the concept of put-with-delay, so you can regularly fill the queue from your primary store with a message that will be able to be reserved, and run, in X seconds (time you want it to run - the time now).

The workers would then run as normal, connecting to the beanstalkd daemon and waiting for a new job to be reserved. It would also be a lot more efficient without the overhead of a HTTP connection. As an example, I used to post messages to Amazon SQS (by http). This could barely do 20 QPS at very most, but Beanstalkd accepted over a thousand per second with barely any effort.

Edited to add: You can't delete a job without knowing it's ID, though you could store that outside. OTOH, do users have to be able to delete jobs at any time up to the last minute? You don't have to put a job into the queue weeks or months in advance, and so you would still only have one DB-reader that ran every, say, 1 to 5 mins to put the next few jobs into the queue, and still have as many workers as you would need, with the efficiencies they can bring.

Ultimately, it depends on the number of DB read/writes that you are doing, and how the database server is able to handle them.

If what you are doing is not a problem now, and won't become so with additional load, then carry on.

Beanstalkd workers would generally run from the command line in a loop, rather than on a HTTP request. I've got a few posts on using it at http://www.phpscaling.com/tag/beanstalkd/ — Alister Bulman, Jan 23 '12 at 19:04
The problem with using beanstalkd, which makes it unsuitable for our use (as far as I can see) is that we need the users to be able to cancel the jobs once created. Looking at beanstalkd there doesn't seem to be a way to track and cancel/update individual jobs once they are in the queue. Or am I missing something? — Mark Perkins, Jan 25 '12 at 09:31

Best solution for running multiple intensive jobs at specific times

1 Answers1