The context
I have an infrastructure where a server produces long running jobs where each job consists of logical chunks that are about the same size, but every job have vastly different amount of chunks. I have a scalable number of workers which can take the chunks do the work (processor heavy), and return the result to the server. One worker works on only one chunk at the same time.
Currently for scheduling the chunks I use an SQS queue so when a job is created I dump all chunks to the SQS queue and the workers will take the chunks. It works like a FIFO.
So to summarize what does what:
A job is a lot of processor intensive calculations. It consists of multiple independent chunks that are about the same size.
A chunk is a processor intensive calculation the workers can work on. Independent of other chunks and can be calculated itself without additional context.
The server creates jobs. When the job is created the server puts all the job's chunks on the Queue (and essentially forgets about the jobs).
The workers can work on chunks. It does not matter what job is the chunk part of, the worker can take on any. A worker when it has nothing to work on (is newly created, or already finished its previous chunk) looks for the next chunk on the queue.
The problem
When a job is scheduled all chunks are added to the queue and when a next job is scheduled it will not be started to be worked on until the first job is finished. So in a scenario where job A (first) takes 4 hours and job B (second) takes 5 minutes, job B will not get started in the first few hours and will only be finished in about 4 hours 5 minutes, so if there is a large job scheduled it will effectively block all other calculations. The queue will look like this:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 ... A100 B1 B2
I would like to not block the new calculations coming in but process them in a different order like:
A1 B1 A2 B2 A3 A4 A5 A6 A7 A8 A9 A10 ... A100
If a third job arrives after A1 and B1 has been picked up, it should still not be blocked:
A2 B2 C1 A3 C2 A4 C3 A5 C4 A6 A7 A8 A9 A10 ... A100
With ordering the chunks like this I can guarantee the following:
- For every job the first task is picked up relatively fast.
- For every job there is constant perceived progress (some new chunks are always finished)
- Short jobs (not many chunks) are finished relatively fast.
Solutions
I know I cannot reorder an SQS queue in itself, so I might have to do something like:
- Change technologies. Maybe some queue supports this out of the box in AWS
- When a new job is about to be scheduled, the server just takes all chunks from the queue, shuffles in the new chunks, puts back everything in the queue.
- Somehow reach the intended behavior with a priority queue (maybe RabbitMQ).
Is there some easy, safe solution for this? How should I do it?