0

I have a query whereby, when enabling sequential scanning on the postgres database and using now() in the where clause, the query planner will prefer a sequential scan of the table and then filter:

    EXPLAIN ANALYZE
    SELECT
        action_id
    FROM
        events
    WHERE
        started_at IS NULL
        AND deleted_at IS NULL
        AND due_at < now()
        AND due_at > now() - interval '14 days'
    LIMIT 1
    FOR UPDATE
        SKIP LOCKED;

Example: https://explain.depesz.com/s/xLlM

Query with enable_seqscan db parameter set to false: https://explain.depesz.com/s/e8Fe

I am looking to help the query optimiser use the index.

I suspect the fact that the number of rows that match started_at is null and deleted_at is null filter roughly makes up 13% of the total table rows (and due_at column is completed unique and uncorrelated) means that the query optimiser is pessimistic about finding a match quickly enough using the index but in fact that's not the case.

EDIT: For the time being I have restructured the query like so:

        SELECT 
            id,
            previous_event_id,
            due_at,
            action_id,
            subscription_url 
        FROM (
            SELECT 
                id, 
                previous_event_id, 
                due_at, 
                action_id, 
                subscription_url from events 
                WHERE 
                    started_at is null 
                    AND deleted_at is null
                LIMIT 100
                FOR update SKIP LOCKED
        ) events_to_pick_from
        WHERE EXISTS (
            SELECT 1
                FROM events
            WHERE
                events_to_pick_from.due_at < now()
                AND events_to_pick_from.due_at > now() - interval '14 days'
                AND events.action_id = events_to_pick_from.action_id
        )
        LIMIT 1
        FOR UPDATE SKIP LOCKED;

https://explain.depesz.com/s/fz2h

But would be grateful for other suggestions

Willeman
  • 720
  • 10
  • 24
  • What happens if you use `localtimestamp` instead of `now()`? –  Nov 16 '20 at 09:31
  • 2
    `... and find the row much quicker. ...` Cache-warm-up ? (the 8M table will probably fit in memory, anyway) – wildplasser Nov 16 '20 at 10:21
  • @a_horse_with_no_name no difference as expected. – Willeman Nov 16 '20 at 11:04
  • @wildplasser Yeah I think you're right. Doing a discard all between queries levels the two queries. – Willeman Nov 16 '20 at 11:04
  • Your row estimates are off (800K vs 1). Do you have valid statistics? – wildplasser Nov 16 '20 at 14:38
  • I can see from pg_stats that the due_at column has n_distinct = -1 and null_fraq is 0, since the timestamps won't ever be the same. Correlation is -0.069. I imagine quantising the due_at timestamps to discrete intervals would help, but feels like timestamps being all unique is common enough situation to be fixable – Willeman Nov 16 '20 at 15:58
  • @wildplasser Upped the statistics for the column without any improvement: `alter table events alter column due_at set statistics 10000;` and `vacuum analyze events;` – Willeman Nov 17 '20 at 16:38
  • Even in your new query, the row estimates are worng. There must be something that you are not showing us. (such as: table definitions, cardinalities, tuning constants) Last comment: avoid NULLS if they dont mean anything. For dates, often -infinity and +infinity are excellent sentinel values) – wildplasser Nov 20 '20 at 23:31

1 Answers1

3

Both queries have the same execution plan.

The difference is that the query with the constants happens to find a row that matches the condition quickly, after reading only 27 rows from the table.

The query using now() does not find a single matching row in the table (actual rows=0), but it has to scan all 7 million rows before it knows for sure.

An index on due_at should improve the performance considerably.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • that is true and perhaps the example is somewhat misleading, the query using now() will still go through all the rows even if it does find a row. I'll update the question with such an example instead. My question is really why does it take longer that way around? due_at is an index and when disabling sequence scanning it uses it well, but I'd like to have it work well without such a custom setting. – Willeman Nov 16 '20 at 10:12
  • 1
    No, the scan will stop as soon as it finds a row that is not yet locked and meets all the conditions. – Laurenz Albe Nov 16 '20 at 10:18
  • OK point taken. I have added a successful query using now. But yet after many samples I can see it takes much longer to find a row using now() - yet from the documentation I can't see why it should – Willeman Nov 16 '20 at 10:32
  • 2
    No, it doesn't take any longer. Perhaps your example is not good, and you can show one that finds a result and still takes long. – Laurenz Albe Nov 16 '20 at 10:34
  • I see you are right. Thanks. I think being able to choose a time-slice that is not exactly now() influences the outcome by a lot, but it is the value of the due_at field not the function now that makes all the difference. When disabling sequence scanning the performance is much better because it uses the composite due_at index, but I don't understand why it wouldn't use the index if you give it a choice – Willeman Nov 16 '20 at 10:54
  • 1
    @Willeman: the optimizer would probably be more likely to choose the index on `due_at` if it was a filtered index e.g. `create index on events(due_at) where started_at is null and deleted_at is null` –  Nov 16 '20 at 11:03
  • @a_horse_with_no_name created the index, and can see it in use with enable_seqscan=false but there's no difference when enabling it as before. `CREATE INDEX events_due_at_partial_idx ON public.events USING btree (due_at) WHERE ((started_at IS NULL) AND (deleted_at IS NULL))` – Willeman Nov 17 '20 at 16:36
  • You still need to show a query that takes longer with `now()` but selects the same rows. Without that we cannot help. – Laurenz Albe Nov 17 '20 at 16:55
  • @LaurenzAlbe after some careful cache discarding between queries I have put my assumption that now() is slower to bed - my main quest was to get postgres to use the due_at index with sequence scanning enabled (as it should be). I appreciate there may not be enough info here to help but I'm hopeful perhaps there are more things I can investigate. General guidance appreciated. – Willeman Nov 17 '20 at 17:05
  • 1
    https://stackoverflow.com/q/10643215/905902 [BTW: your one-table query should not be a difficult case. With the correct data-model and tuning the engine should pick the correct plan without any tweaking] – wildplasser Nov 18 '20 at 11:23
  • @wildplasser thanks I can definitely look to improve things on the engine side - for the time being decided to restructure the query with nested loop semi join – Willeman Nov 19 '20 at 12:05
  • SELECT id, previous_event_id, due_at, action_id, subscription_url FROM ( SELECT id, previous_event_id, due_at, action_id, subscription_url from events WHERE started_at is null AND deleted_at is null LIMIT 100 FOR update SKIP LOCKED ) events_to_pick_from WHERE EXISTS ( SELECT 1 FROM events WHERE events_to_pick_from.due_at < now() AND events_to_pick_from.due_at > now() - interval '14 days' AND events.action_id = events_to_pick_from.action_id ) LIMIT 1 – Willeman Nov 19 '20 at 12:06
  • 1
    Either edit the question with the new content or ask a new one. Don't forget `EXPLAIN (ANALYZE, BUFFERS)` output. – Laurenz Albe Nov 19 '20 at 12:07