Distributed Task Queue Based on Sets as a Data Structure instead of Lists

Question

I'm wondering if there's a way to set up RabbitMQ or Redis to work with Celery so that when I send a task to the queue, it doesn't go into a list of tasks, but rather into a Set of tasks keyed based on the payload of my task, in order to avoid duplicates.

Here's my setup for more context: Python + Celery. I've tried RabbitMQ as a backend, now I'm using Redis as a backend because I don't need the 100% reliability, easier to use, small memory footprint, etc.

I have roughly 1000 ids that need work done repeatedly. Stage 1 of my data pipeline is triggered by a scheduler and it outputs tasks for stage 2. The tasks contain just the id for which work needs to be done and the actual data is stored in the database. I can run any combination or sequence of stage 1 and stage 2 tasks without harm.

If stage 2 doesn't have enough processing power to deal with the volume of tasks output by stage 1, my task queue grows and grows. This wouldn't have to be the case if the task queue used sets as the underlying data structure instead of lists.

Is there an off-the-shelf solution for switching from lists to sets as distributed task queues? Is Celery capable of this? I recently saw that Redis has just released an alpha version of a queue system, so that's not ready for production use just yet.

Should I architect my pipeline differently?

With RabbitMQ you could lazy-create a queue for each unique ID with a max queue depth (`x-max-length`) of 1 . There's the extra housekeeping of publishing & subscribing to 1000 different queues, but duplicates would be dropped as you require. — tariksbl, May 17 '15 at 14:33
This is exactly the kind of workaround type of logic I'm looking for, but this particular solution seems tedious and also I'd prefer to stay away from RabbitMQ going forward. However, thank you for the creativity! — HostedMetrics.com, May 24 '15 at 02:11

score 2 · Answer 1 · answered May 21 '15 at 10:32

You can use an external data structure to store and monitor the current state of your celery queue. 1. Lets take a redis key-value for example. Whenever you push a task into celery, you mark a key with your 'id' field as true in redis.

Before trying to push a new task with any 'id', you would check if the key with 'id' is true in redis or not, if yes, you skip pushing the task.
To clear the keys at proper time, you can use after_return handler of celery, which runs when the task has returned. This handler will unset the key 'id' in redis , hence clearing the lock for next task push .

This method ensures you only have ONE instance per id of task running in celery queue. You can also enhance it to allow only N tasks per id by using INCR and DECR commands on the redis key, when the task is pushed and after_return of the task.

score 1 · Answer 2 · answered May 17 '15 at 01:37

1

Can your tasks in stage 2 check whether the work has already been done and, if it has, then not do the work again? That way, even though your task list will grow, the amount of work you need to do won't.

I haven't come across a solution re the sets / lists, and I'd think there were lots of other ways of getting around this issue.

answered May 17 '15 at 01:37

Maximilian

7,512
3
50
63

The size of the task queue itself is the problem. Depending on which software you use as your solution for queueing, you either end up taking up memory or taking up space in the database. Neither of those is desirable. I'd prefer a well-behaved system. – HostedMetrics.com May 17 '15 at 15:38
Right, I see. Presumably even if you did have solution using sets, you'd be queuing up work that was already complete? I wonder if there's a way of recording "Work done or planning to be done" somewhere, and only adding the task to the queue if it's not in that list. – Maximilian May 18 '15 at 16:45

score -1 · Answer 3 · answered May 19 '15 at 12:20

Use a SortedSet within Redis for your jobs queue. It is indeed a Set so if you put the exact same data inside it won't add a new value in it (it absolutely needs to be the exact same data, you can't override the hash function used in SortedSet in Redis).

You will need a score to use with SortedSet, you can use a timestamp (value as a double, using unixtime for instance) that will allow you to get the most recent items / oldest items if you want. ZRANGEBYSCORE is probably the command you will be looking for. http://redis.io/commands/zrangebyscore

Moreover, if you need additional behaviours, you can wrap everything inside a Lua Script for atomistic behaviour and custom eviction strategy if needed. For instance calling a "get" script that gets the job and remove it from the queue atomically or evicts data if there is too much back pressure etc.

Downvoted. The OP already knows that sets or sorted sets will solve the problems. What the OP is asking is that the sorted set/set is easily pluggable in Celery or not. — DhruvPathak, May 21 '15 at 10:18

Distributed Task Queue Based on Sets as a Data Structure instead of Lists

3 Answers3