Low cardinality index with uneven distribution of possible values

Question

create table items 
(
    itemId     int, -- PK, identity
    status     int, -- Possible values: 0 = Pending, 1 = Processing, 2 = Processed
    createdAt  datetime2,
    updatedAt  datetime2,
    content    text
)

Producers add records to the queue and consumers read them FIFO:

Producers add records with Pending status with createdAt as current time
Consumers select records in Pending status ordered by createdAt. When consuming, they mark the record as Processing and set the updatedAt to current time (using an update/select CTE)
After processing, consumers mark records as Processed
While processing, consumers may crash and therefore won't be able to mark the record as Processed
When another consumer finds a record that's stuck in Processing state for longer than x minutes (i.e. updatedAt < current_time - x and status = Processing) they pick them up and process (let's assume the new consumer won't crash ;) )
The table has about 1M records and grows about 20k a day
At any given time there will be < 100 Pending and Processing records

I have 2 questions

Given this scenario (especially the last 2 points), would an index on (status, createdAt) with updatedAt as an included column be a good index?

I tried that and it shows that the index is hit and execution time is very fast (sub-second). However, I'm not quite sure this low cardinality index (with starting column status) is a good index because such indexes are generally considered bad. I'm wondering if it works in my case because of the highly uneven distribution of the possible values (<1% of records are Pending, InProgress which is what I'm querying on. No queries are run to select Processed ones).

I added the included column (updatedAt) to support this filter status = Processing and updatedAt < current_time - x but I'm not sure it's useful. Does the query planner care about the included column or only looks at the columns in the index (status, createdAt)?

Bonus points if you answer both questions ;)

score 4 · Answer 1 · answered Apr 18 '20 at 12:29

One of the main purposes of an index is to reduce the number of rows being read from the table. A low cardinality index means that a column only takes on a handful of values. So, if a table has a ten million rows and there are ten values, then each value would have -- on average -- a million rows.

An index is not helpful for fetching a million rows from a ten million row table, because every (or just about every) data page will have a matching row. The purpose is to reduce the number of data pages being read.

So, your use of an index is quite reasonable, because there are only a few rows you are finding. You are using the index to find non-processed rows, and there are few of these.

Your index is much larger than necessary because it has information about processed rows. You might consider a filtered index. From what you describe, I think this would be:

create index idx_items_status_updated
    on items(status, updated)
    where status in (1, 2);

Sometimes, in these situations, you want to use a clustered index on state. Basically, this allows the "processed" items to grouped together. And, if the data pages they are on are not accessed, then those data pages do not need to be loaded.

In you case, though, I am guessing that items are added sequentially so only more recent items are being processed. Older data pages will be filled with processed items and -- because they are never referenced -- may not even occupy space in memory.

Thanks @Gordon can you also give some hints on second part i.e. included columns is it even necessary? — ubi, Apr 19 '20 at 00:46
@ubi . . . Including columns reduces the size of the index, so that can be important for optimization. In practice, though, I usually just add the additional columns as keys. — Gordon Linoff, Apr 19 '20 at 01:02

Low cardinality index with uneven distribution of possible values

1 Answers1