Our system has jobs that are consuming input queues containing ids of items that need to be taken as input for the jobs. There are a few thousand of those input queues and each queue contains from a few ten-thousands up to a few million ids. A job typically takes a batch of ids from one queue (around 20.000) and does it's job. On the other hand I've got some producers that push ids into the queues. These also work in batches, so we are often inserting a few then-thousand up to a few million ids at the same time into the queue.
We did not use messaging systems, like Rabbit, because our producers often push duplicates into the queue - so it's preferable for us to have a set logic. Additionally, our jobs do get a notification once stuff is pushed into the queue - so there's no need to subscribe to it.
Queue content is temporary and data may be lost in case of failure.
Can anyone recommend how to best solve this problem ?
We are currently using a RDBMS table where the id is the primary key and there's a second column that identifies the queue. Inserts are done using a ON DUPLICATE KEY UPDATE syntax, so we can do everything in a single batched statement. Disadvantage is the high IO load. Advantage is that we can easily look into the queue contents and very easily perform manual actions (bulk inserts, deletes, etc...) in case we need to manually intervent.
I'm wondering if Redis could be a choice for us (using Sets?) - what about memory limits ? Does it perform when it's disk bound ? What happens if we want to "take" (get & remove) stuff from the Set / Queue ? Does it perform or put large load regarding IO ?
Any input, no matter about the technology (we are using JVM based languages) or database, is welcome!