0

Our system has jobs that are consuming input queues containing ids of items that need to be taken as input for the jobs. There are a few thousand of those input queues and each queue contains from a few ten-thousands up to a few million ids. A job typically takes a batch of ids from one queue (around 20.000) and does it's job. On the other hand I've got some producers that push ids into the queues. These also work in batches, so we are often inserting a few then-thousand up to a few million ids at the same time into the queue.

We did not use messaging systems, like Rabbit, because our producers often push duplicates into the queue - so it's preferable for us to have a set logic. Additionally, our jobs do get a notification once stuff is pushed into the queue - so there's no need to subscribe to it.

Queue content is temporary and data may be lost in case of failure.

Can anyone recommend how to best solve this problem ?

We are currently using a RDBMS table where the id is the primary key and there's a second column that identifies the queue. Inserts are done using a ON DUPLICATE KEY UPDATE syntax, so we can do everything in a single batched statement. Disadvantage is the high IO load. Advantage is that we can easily look into the queue contents and very easily perform manual actions (bulk inserts, deletes, etc...) in case we need to manually intervent.

I'm wondering if Redis could be a choice for us (using Sets?) - what about memory limits ? Does it perform when it's disk bound ? What happens if we want to "take" (get & remove) stuff from the Set / Queue ? Does it perform or put large load regarding IO ?

Any input, no matter about the technology (we are using JVM based languages) or database, is welcome!

Peter Rietzler
  • 471
  • 5
  • 11

1 Answers1

0

If you store just ids, then redis and its set is a perfect tool for the job. It handles the uniqueness, does not have that slow sql part, SPOP can pop multiple items at once (randomly chosen, though).

However, it does not work very well when amount of data exceeds available ram, so you should take that into account (just get enough RAM). On the plus side, no I/O on each transaction! :)

There are a few thousand of those input queues and each queue contains from a few ten-thousands up to a few million ids

Depending on a size of the ids, this dataset might be problematic to fit on one single machine. Since you only use one queue at a time (correct?), you can safely deploy redis cluster, which will shard the dataset across multiple machines.

Sergio Tulentsev
  • 226,338
  • 43
  • 373
  • 367
  • does redis shard fully automatically or do I have to take care that it won't place too many large queues on a single machine - and thus exceed the available RAM ? I could actually do that by providing a custom sharding function since I have a good estimation of each queue size. – Peter Rietzler Jul 30 '16 at 10:57