SELECT FOR UPDATE subquery not respecting LIMIT clause under load

Question

I have the following query that uses the psql SKIP LOCKED feature to poll messages from a queue table sorted by the insertion time. My subquery also limits the result set by 10 since that is the maximum number of results that can be processed by the consumer.

DELETE FROM outbox_queue
WHERE id IN (
    SELECT id FROM outbox_queue
    WHERE delay_until IS NULL OR now() > delay_until
    ORDER BY enqueued_on, id
    LIMIT 10
    FOR UPDATE SKIP LOCKED
)
RETURNING *

Under normal load, everything works as expected. However when I apply load tests to my system, I will occasionally (<5%) get results the range from dozens of records to even several hundred records being returned. This is confusing to me since my understanding is that the subquery should have a limit placed on the number of records returned.

I've checked my logs and verified that the records being pulled are indeed unique and are not duplicates. Once load is reduced, the behavior returns to what I would expect.

My application has three workers that poll messages from this table, and each worker has a single thread that runs this query and processes the result.

Postgres version:

PostgreSQL 10.18 on x86_64-pc-linux-gnu, compiled by x86_64-pc-linux-gnu-gcc (GCC) 7.4.0, 64-bit

What am I missing?

Edit: Also including query plans for my old query posted in problem, as well as query plan for Erwin's answer.

Original query plan:

Delete on outbox_queue  (cost=1.87..85.97 rows=10 width=38) (actual time=1.324..1.358 rows=10 loops=1)
  ->  Nested Loop  (cost=1.87..85.97 rows=10 width=38) (actual time=1.316..1.339 rows=10 loops=1)
        ->  HashAggregate  (cost=1.44..1.54 rows=10 width=40) (actual time=1.297..1.300 rows=10 loops=1)
"              Group Key: ""ANY_subquery"".id"
"              ->  Subquery Scan on ""ANY_subquery""  (cost=0.42..1.42 rows=10 width=40) (actual time=0.034..1.289 rows=10 loops=1)"
                    ->  Limit  (cost=0.42..1.32 rows=10 width=22) (actual time=0.030..1.280 rows=10 loops=1)
                          ->  LockRows  (cost=0.42..90221.41 rows=1009865 width=22) (actual time=0.030..1.277 rows=10 loops=1)
                                ->  Index Scan using outbox_queue__enqueued_on on outbox_queue outbox_queue_1  (cost=0.42..80122.76 rows=1009865 width=22) (actual time=0.023..1.256 rows=10 loops=1)
                                      Filter: ((delay_until IS NULL) OR (now() > delay_until))
        ->  Index Scan using outbox_queue_pkey on outbox_queue  (cost=0.42..8.44 rows=1 width=14) (actual time=0.003..0.003 rows=1 loops=10)
"              Index Cond: (id = ""ANY_subquery"".id)"
Planning time: 0.230 ms
Execution time: 1.407 ms

Query plan for Erwin's answer. As you can see, the LIMIT is now being evaluated outside the nested loop, whereas previously it was being evaluated inside.

Delete on outbox_queue  (cost=1.97..86.07 rows=10 width=38) (actual time=0.065..0.094 rows=10 loops=1)
  CTE sel
    ->  Limit  (cost=0.42..1.32 rows=10 width=22) (actual time=0.030..0.039 rows=10 loops=1)
          ->  LockRows  (cost=0.42..90221.41 rows=1009865 width=22) (actual time=0.029..0.037 rows=10 loops=1)
                ->  Index Scan using outbox_queue__enqueued_on on outbox_queue outbox_queue_1  (cost=0.42..80122.76 rows=1009865 width=22) (actual time=0.024..0.027 rows=10 loops=1)
                      Filter: ((delay_until >= now()) IS NOT TRUE)
  ->  Nested Loop  (cost=0.65..84.75 rows=10 width=38) (actual time=0.061..0.080 rows=10 loops=1)
        ->  HashAggregate  (cost=0.23..0.33 rows=10 width=40) (actual time=0.053..0.056 rows=10 loops=1)
              Group Key: sel.id
              ->  CTE Scan on sel  (cost=0.00..0.20 rows=10 width=40) (actual time=0.034..0.049 rows=10 loops=1)
        ->  Index Scan using outbox_queue_pkey on outbox_queue  (cost=0.42..8.44 rows=1 width=14) (actual time=0.002..0.002 rows=1 loops=10)
              Index Cond: (id = sel.id)
Planning time: 0.153 ms
Execution time: 0.174 ms

And here is my schema for the outbox_queue table.

create table if not exists outbox_queue (
    id              bigserial not null primary key,
    message_type    varchar not null,
    body            varchar not null,
    enqueued_on     timestamptz not null,
    delay_until     timestamptz,
    attributes      jsonb not null
);

create index if not exists outbox_queue__enqueued_on on outbox_queue(enqueued_on);
create index if not exists outbox_queue__delay_until on outbox_queue(delay_until);

I would be interested in the exact query plan produced with `EXPLAIN (ANALYZE)`. — Erwin Brandstetter, Oct 06 '22 at 00:34

Erwin Brandstetter · Accepted Answer · 2022-10-06T00:32:03.573

The query plan for this particular combination can result in a Nested Loop, which can exceed the LIMIT set in the subquery. See:

Postgres UPDATE ... LIMIT 1

Rows are locked after they have been returned from the subquery, and running this in a nested loop in combination with SKIP LOCKED can sidestep the LIMIT. I have not fully verified the exact mechanism to this day. Also, this may have been fixed in recent versions. Didn't rerun tests, yet.

Materializing the selection in a CTE reliably prevents the problem in any case:

WITH sel AS (
   SELECT id
   FROM   outbox_queue
   WHERE  delay_until >= now() IS NOT TRUE  -- simpler & may be faster
-- WHERE  delay_until IS NULL OR now() > delay_until
   ORDER  BY enqueued_on, id
   LIMIT  10
   FOR    UPDATE SKIP LOCKED
   )
DELETE FROM outbox_queue
WHERE  id IN (TABLE sel)
RETURNING *

While being at it I replaced:

delay_until IS NULL OR now() > delay_until

with:

delay_until >= now() IS NOT TRUE

Equivalent. Simpler. May be faster.

Postgres 12 added the MATERIALIZED option. Explicitly materialize to be sure:

WITH sel AS MATERIALIZED (  -- !
   SELECT id FROM outbox_queue
   WHERE  delay_until >= now() IS NOT TRUE
   ORDER  BY enqueued_on, id
   LIMIT  10
   FOR    UPDATE SKIP LOCKED
   )
DELETE FROM outbox_queue
WHERE  id IN (TABLE sel)
RETURNING *

The manual:

By default, a side-effect-free WITH query is folded into the primary query if it is used exactly once in the primary query's FROM clause. This allows joint optimization of the two query levels in situations where that should be semantically invisible. However, such folding can be prevented by marking the WITH query as MATERIALIZED. That might be useful, for example, if the WITH query is being used as an optimization fence to prevent the planner from choosing a bad plan. PostgreSQL versions before v12 never did such folding, so queries written for older versions might rely on WITH to act as an optimization fence.

Wow, this is very interesting, I had no idea that this was possible but I guess it makes sense. I appreciate you taking the time to give such a thorough answer, and the optimization tip on the `where` clause. This CTE query is working as expected - execution time ~10-15ms with 1MM records in the table. I am kicking off my load tests again now and will report back when I get the results. Thanks! — Dan, Oct 06 '22 at 01:34
I cannot see how that could happen unless `id` is not unique. But it seems like you have no good explanation either... — Laurenz Albe, Oct 06 '22 at 07:05
Corect, id is autoincrementing bigserial PK. In my logs I am logging all the records returned, and all the IDs are always unique. — Dan, Oct 06 '22 at 11:27
I've updated my question to include both old/new query plans as well as the table schema. — Dan, Oct 06 '22 at 11:33
Update - this does seem to have done the trick. Thanks again! — Dan, Oct 06 '22 at 20:28
I was debugging a random test failure for days due to a LIMIT query returning too many records. This answer helped me solve the issues. Thanks! — Glenn, Aug 04 '23 at 11:53

SELECT FOR UPDATE subquery not respecting LIMIT clause under load

1 Answers1