The Problem
I need a job queue system that's much more complex than a standard FIFO or priority queue. The queue would mostly need to act like a standard FIFO queue, but with much more complex logic surrounding the dequeing and running of jobs. In my case, a job should only be dequeued if certain user and system concurrency limits wouldn't be exceeded (for example, the user can only have 10 concurrent jobs of a certain type, the entire system can only have 100 concurrent jobs of a certain type, etc.)
The goal is to have a kubernetes cluster be the consumer of this queue, where if all necessary conditions are met, k8's will dequeue the job and spin up a new container to run it. I'm not a k8's expert, but I don't think we'd be able to have k8's run these concurrency checks before dequeing a given job. So, what I think I need is a job queue system that bakes these checks into the queue itself by only allowing jobs that pass certain checks to be dequeued.
What We've Tried
We've implemented our own job queue system using a sql database. There's a master job queue table in this database that contains information for every job. We've then created our own application that periodically (every 10 seconds or so) runs complex queries on this table to determine what jobs should be dequeued and ran. This aplication then starts worker processes to run these jobs (not in containers, just standard processes).
There are several problems with this approach. First, the query for finding jobs that are ready to be ran is very complicated and slow. Also, when there's a lot of activity on the system, the job queue table can be a massive bottleneck for the entire system. Also, since we want to start running these worker processes in their own docker containers, we'd like a kubernetes cluster to be the direct consumer of the queue if possible rather than have our own application as an intermediary.
The Question
What popular approaches are there to complex job queues? I can't imagine we're the only ones who need a job queue that imposes concurrency limits, and I also can't imagine that our SQL approach is the best way to achieve what we need. What could we do in this situation to make our job queue system as performant as possible?