2

I have two queries that when run separately, each take 1-10 seconds to execute. When I combine their results with UNION ALL, the execution time jumps up to 4500 seconds (over 1 hour)!

I have run the long query with EXPLAIN ANALYZE and it appears to iterate over man_jour_tracking 1.6B times - even though that table is only ~11k lines long. It looks to be creating man_jour_tracking on each join -- but only when the UNION is present, not when run on it's own. Explain Analyze Output

I do not know if it has to do with the json_object data type. But somehow the planner is not handling this correctly since they execute on their own and all records will be unique.

(I have taken out a variety of select columns to shorten the query since they are not relevant). PostgreSQL 13.7

Running this query alone (first SELECT from union query) take ~5 seconds.

WITH most_recent_journum as (
    SELECT
        max(journal_number) as journum,
        max(journal_date) as jourdate,
        max(source_type)
    FROM journal
    GROUP BY source_id
),

man_jour_tracking as (
    SELECT
        json_object(Array_agg(tracking_category.name), array_agg(option)) as tracking,
        manual_journal_line_id,
        manual_journal_id
    FROM
        manual_journal_line_has_tracking_category line_tracking
        LEFT JOIN tracking_category on line_tracking.tracking_category_id = tracking_category.tracking_category_id
    GROUP BY manual_journal_line_id, manual_journal_id
)

SELECT
    journal.journal_id as journal_id,
    journal.journal_number as journal_number,
    gross_amount as gross_amount,
    line.account_code as account_code,
    line.account_name as account_name,
    line.account_type as account_type,
    line.description as description,
    manual_journal.status as status,
    trim('"' FROM (man_jour_tracking.tracking -> 'Tracking')::text) as tracking,
    trim('"' FROM (man_jour_tracking.tracking -> 'Events Tracking')::text) as events_tracking,
    NULL as invoice_number,
    NULL as contact_name,
    manual_journal.narration as source_description
FROM journal
    LEFT JOIN journal_line line on line.journal_id = journal.journal_id
    LEFT JOIN manual_journal ON journal.source_id = manual_journal.manual_journal_id
    LEFT JOIN manual_journal_line manual_line ON
        manual_line.manual_journal_id = manual_journal.manual_journal_id AND line.account_code = manual_line.account_code AND line.description = manual_line.description AND line.gross_amount = manual_line.line_amount
    LEFT JOIN 
        man_jour_tracking ON
        manual_journal.manual_journal_id = man_jour_tracking.manual_journal_id AND man_jour_tracking.manual_journal_line_id::int = manual_line.line
WHERE
    journal.source_type = 'MANJOURNAL' AND
    journal.journal_number in (select journum from most_recent_journum) AND
    journal.journal_date AT TIME ZONE 'UTC' >= '2022-10-01'::date AND journal_date AT TIME ZONE 'UTC' < '2023-01-01'::date AND
    line.journal_line_id is not null AND
    line.account_code <> '2650' AND
    manual_journal.status = 'POSTED'

Running this query alone (second SELECT from union query) takes ~5 seconds

WITH most_recent_journum as (
    SELECT
        max(journal_number) as journum,
        max(journal_date) as jourdate,
        max(source_type)
    FROM journal
    GROUP BY source_id
),

man_jour_tracking as (
    SELECT
        json_object(Array_agg(tracking_category.name), array_agg(option)) as tracking,
        manual_journal_line_id,
        manual_journal_id
    FROM
        manual_journal_line_has_tracking_category line_tracking
        LEFT JOIN tracking_category on line_tracking.tracking_category_id = tracking_category.tracking_category_id
    GROUP BY manual_journal_line_id, manual_journal_id
)

SELECT
    journal.journal_id as journal_id,
    journal.journal_number as journal_number,
    NULL::int as source_line,
    gross_amount as gross_amount,
    line.account_code as account_code,
    line.account_name as account_name,
    line.account_type as account_type,
    line.description as description,
    invoice.status as status,
    NULL as tracking,
    NULL as events_tracking,
    invoice.invoice_number,
    contact.name as contact_name,
    NULL as source_description
FROM journal
    LEFT JOIN journal_line line on line.journal_id = journal.journal_id 
    LEFT JOIN invoice ON journal.source_id = invoice.invoice_id
    LEFT JOIN contact on invoice.contact_id = contact.contact_id
WHERE
    journal.source_type in ('ACCPAY', 'ACCREC') AND
    journal.journal_number in (select journum from most_recent_journum) AND
    journal.journal_date AT TIME ZONE 'UTC' >= '2022-10-01'::date AND journal_date AT TIME ZONE 'UTC' < '2023-01-01'::date AND
    line.journal_line_id is not null AND
    invoice.status not in ('VOIDED', 'DELETED')

But when you combine them together, the query takes over 1 hour.

WITH most_recent_journum as (
    SELECT
        max(journal_number) as journum,
        max(journal_date) as jourdate,
        max(source_type)
    FROM journal
    GROUP BY source_id
),

man_jour_tracking as (
    SELECT
        json_object(Array_agg(tracking_category.name), array_agg(option)) as tracking,
        manual_journal_line_id,
        manual_journal_id
    FROM
        manual_journal_line_has_tracking_category line_tracking
        LEFT JOIN tracking_category on line_tracking.tracking_category_id = tracking_category.tracking_category_id
    GROUP BY manual_journal_line_id, manual_journal_id
)

SELECT
    journal.journal_id as journal_id,
    journal.journal_number as journal_number,
    gross_amount as gross_amount,
    line.account_code as account_code,
    line.account_name as account_name,
    line.account_type as account_type,
    line.description as description,
    manual_journal.status as status,
    trim('"' FROM (man_jour_tracking.tracking -> 'Tracking')::text) as tracking,
    trim('"' FROM (man_jour_tracking.tracking -> 'Events Tracking')::text) as events_tracking,
    NULL as invoice_number,
    NULL as contact_name,
    manual_journal.narration as source_description
FROM journal
    LEFT JOIN journal_line line on line.journal_id = journal.journal_id
    LEFT JOIN manual_journal ON journal.source_id = manual_journal.manual_journal_id
    LEFT JOIN manual_journal_line manual_line ON
        manual_line.manual_journal_id = manual_journal.manual_journal_id AND line.account_code = manual_line.account_code AND line.description = manual_line.description AND line.gross_amount = manual_line.line_amount
    LEFT JOIN 
        man_jour_tracking ON
        manual_journal.manual_journal_id = man_jour_tracking.manual_journal_id AND man_jour_tracking.manual_journal_line_id::int = manual_line.line
WHERE
    journal.source_type = 'MANJOURNAL' AND
    journal.journal_number in (select journum from most_recent_journum) AND
    journal.journal_date AT TIME ZONE 'UTC' >= '2022-10-01'::date AND journal_date AT TIME ZONE 'UTC' < '2023-01-01'::date AND
    line.journal_line_id is not null AND
    line.account_code <> '2650' AND
    manual_journal.status = 'POSTED'

UNION ALL

SELECT
    journal.journal_id as journal_id,
    journal.journal_number as journal_number,
    NULL::int as source_line,
    gross_amount as gross_amount,
    line.account_code as account_code,
    line.account_name as account_name,
    line.account_type as account_type,
    line.description as description,
    invoice.status as status,
    NULL as tracking,
    NULL as events_tracking,
    invoice.invoice_number,
    contact.name as contact_name,
    NULL as source_description
FROM journal
    LEFT JOIN journal_line line on line.journal_id = journal.journal_id 
    LEFT JOIN invoice ON journal.source_id = invoice.invoice_id
    LEFT JOIN contact on invoice.contact_id = contact.contact_id
WHERE
    journal.source_type in ('ACCPAY', 'ACCREC') AND
    journal.journal_number in (select journum from most_recent_journum) AND
    journal.journal_date AT TIME ZONE 'UTC' >= '2022-10-01'::date AND journal_date AT TIME ZONE 'UTC' < '2023-01-01'::date AND
    line.journal_line_id is not null AND
    invoice.status not in ('VOIDED', 'DELETED')

Any help is much appreciated!

Things I have tried:

  1. Separated man_jour_tracking out from a CTE into a temp table (CREATE TEMP TABLE man_jour_tracking) and created an index (CREATE INDEX idx_man_jour_tracking ON man_jour_tracking (manual_journal_id, manual_journal_line_id)) in hopes of speeding up the join or forcing the planner to cache the joined table. No faster.
  2. Switched order of SELECT statements within union
  3. Replacing UNION ALL with UNION - no help.
  4. Even remove man_jour_tracking CTE entirely and still takes ages longer when combined than when separate.

EDIT - SOLVED Thanks to @jjanes, creating the CTE's with as NOT MATERIALIZED fixed it! I do think it's a bit counter-intuitive because you would want the CTE cached, however perhaps it removes a critical index that substantially slows down the joins. Another post here for more context.

Scott C.
  • 21
  • 2
  • Execution plans are better shared as formatted text, generated using `explain (analyze, buffers) ...` the graphical display simply hides too many important details. If it's to big to be included in the question, you can share it e.g. through https://explain.depesz.com/ –  Mar 27 '23 at 06:01
  • Here is the full explain text: https://explain.depesz.com/s/Zh2s – Scott C. Mar 27 '23 at 13:53
  • Your plan references a CTE "constants", but no such CTE exists in your question. Since you showed us some amended query, show the plan for the amended query. – jjanes Mar 27 '23 at 14:17
  • 1
    What happens if you do `WITH most_recent_journum as NOT MATERIALIZED (...`? – jjanes Mar 27 '23 at 14:20
  • @jjanes `NOT MATERIALIZED` did it!! It brought the runtime down to under 20 seconds. "constants" was a very brief CTE where it just kept constants to be referenced throughout the query (`SELECT '2022-10-01'::date as start_date, '2023-01-01'::date as end_date`) and non-material in terms of runtime. – Scott C. Mar 27 '23 at 15:57
  • @jjanes any idea why NOT MATERIALIZED fixes it? Wouldn't you want the CTE cached so that it does not need to be created with each iteration? I would have thought the problem was that it wasn't materialized in the first place, hence the 1.6B rows recreated with each iteration. – Scott C. Mar 27 '23 at 16:03
  • That is odd to me, but without a plan which matches the query or a query which matches the plan, there is no way to investigate it. – jjanes Mar 27 '23 at 20:43

0 Answers0