I have a table that's used as a queue:
create table items
(
itemId int, -- PK, identity
status int, -- Possible values: 0 = Pending, 1 = Processing, 2 = Processed
createdAt datetime2,
updatedAt datetime2,
content text
)
Producers add records to the queue and consumers read them FIFO:
- Producers add records with
Pending
status withcreatedAt
as current time - Consumers select records in
Pending
status ordered bycreatedAt
. When consuming, they mark the record asProcessing
and set theupdatedAt
to current time (using anupdate/select
CTE) - After processing, consumers mark records as
Processed
- While processing, consumers may crash and therefore won't be able to mark the record as
Processed
- When another consumer finds a record that's stuck in
Processing
state for longer than x minutes (i.e.updatedAt < current_time - x
andstatus = Processing
) they pick them up and process (let's assume the new consumer won't crash ;) ) - The table has about 1M records and grows about 20k a day
- At any given time there will be < 100
Pending
andProcessing
records
I have 2 questions
- Given this scenario (especially the last 2 points), would an index on
(status, createdAt)
withupdatedAt
as an included column be a good index?
I tried that and it shows that the index is hit and execution time is very fast (sub-second). However, I'm not quite sure this low cardinality index (with starting column status
) is a good index because such indexes are generally considered bad. I'm wondering if it works in my case because of the highly uneven distribution of the possible values (<1% of records are Pending
, InProgress
which is what I'm querying on. No queries are run to select Processed
ones).
- I added the included column (
updatedAt
) to support this filterstatus = Processing and updatedAt < current_time - x
but I'm not sure it's useful. Does the query planner care about the included column or only looks at the columns in the index (status, createdAt
)?
Bonus points if you answer both questions ;)