Improve performance of PostgreSQL query with LEFT JOIN and subquery

Question

Using 9.4.20 version.

I am having a hard time trying to optimize a query. I think after looking at the EXPLAIN that the issue is in the subquery for the second left join. Notice that I decided to do a subquery because I have multiple sources that can be considered as responses, so I am trying to have a type of homogenous data source in there by (when necessary) doing UNION ALLs. For the simplicity of the question, I didn't do any UNION ALLs because that's not the main problem I am facing. Here are the facts:

Query

SELECT labels."labelID",
    COUNT(DISTINCT "demographics"."user_id") as "memberCount",
    AVG("t"."value") FILTER(WHERE "t"."uuid" = '6d5f647f-6251-4ded-bb9f-9523debf2bd5' AND "t"."type" = 'Rating') AS "6d5f647f-6251-4ded-bb9f-9523debf2bd5__RA",
    AVG("t"."value") FILTER(WHERE "t"."uuid" = '40ff1921-9d62-4183-9de8-36d845d47494' AND "t"."type" = 'Rating') AS "40ff1921-9d62-4183-9de8-36d845d47494__RA"
FROM (
    SELECT DISTINCT ("user_info"->>'gender') as "labelID",
        ("user_info"->>'gender') as "labelName"
    FROM "user_demographics"
    WHERE "iteration_id" = 2 AND "company_id" = 1
    LIMIT 8
) as "labels"
LEFT JOIN "user_demographics" as "demographics" on "labels"."labelID" IS NOT DISTINCT FROM ("demographics"."user_info"->>'gender')::text and "demographics"."iteration_id" = 2
LEFT JOIN (
  SELECT questions.uuid::uuid,
    "responses"."id" as "response_id",
    'Rating'::text as "type",
    (groups.type)::text as "group_type",
    "requests"."user_id" as user_id,
    ("responses"."content"->>'rating')::numeric as "value"
  FROM responses
  INNER JOIN request_questions ON request_questions.id = responses.request_question_id AND request_questions.question_id IN (SELECT id FROM questions WHERE uuid IN ('6d5f647f-6251-4ded-bb9f-9523debf2bd5', '40ff1921-9d62-4183-9de8-36d845d47494'))
  INNER JOIN questions ON questions.id = request_questions.question_id
  INNER JOIN requests ON requests.id = request_questions.request_id
  INNER JOIN groups ON groups.id = requests.group_id
  WHERE requests.submitted_at IS NOT NULL
) as t on t.user_id = demographics.user_id
group by "labels"."labelID", "labels"."labelName";

Explain Analyze

GroupAggregate  (cost=12894.34..12903.78 rows=8 width=306) (actual time=604.444..604.444 rows=1 loops=1)
  Group Key: ((user_demographics.user_info ->> 'gender'::text)), ((user_demographics.user_info ->> 'gender'::text))
  ->  Sort  (cost=12894.34..12894.93 rows=233 width=306) (actual time=555.363..566.431 rows=46019 loops=1)
        Sort Key: ((user_demographics.user_info ->> 'gender'::text)), ((user_demographics.user_info ->> 'gender'::text))
        Sort Method: external sort  Disk: 4488kB
        ->  Hash Left Join  (cost=2091.03..12885.18 rows=233 width=306) (actual time=497.782..531.016 rows=46019 loops=1)
              Hash Cond: (demographics.user_id = requests.user_id)
              ->  Nested Loop Left Join  (cost=1475.26..12266.78 rows=233 width=68) (actual time=6.326..16.015 rows=5898 loops=1)
                    Join Filter: (NOT (((user_demographics.user_info ->> 'gender'::text)) IS DISTINCT FROM (demographics.user_info ->> 'gender'::text)))
                    ->  Limit  (cost=1341.68..1341.80 rows=8 width=323) (actual time=5.918..5.920 rows=1 loops=1)
                          ->  HashAggregate  (cost=1341.68..1342.54 rows=57 width=323) (actual time=5.918..5.920 rows=1 loops=1)
                                Group Key: (user_demographics.user_info ->> 'gender'::text), (user_demographics.user_info ->> 'gender'::text)
                                ->  Bitmap Heap Scan on user_demographics  (cost=373.83..1340.31 rows=274 width=323) (actual time=1.160..4.462 rows=5898 loops=1)
                                      Recheck Cond: ((iteration_id = 2) AND (company_id = 1))
                                      Heap Blocks: exact=301
                                      ->  BitmapAnd  (cost=373.83..373.83 rows=274 width=0) (actual time=1.121..1.121 rows=0 loops=1)
                                            ->  Bitmap Index Scan on user_demographics_iteration_id_fkey  (cost=0.00..132.11 rows=5826 width=0) (actual time=0.403..0.403 rows=5898 loops=1)
                                                  Index Cond: (iteration_id = 2)
                                            ->  Bitmap Index Scan on user_demographics_company_id_fkey  (cost=0.00..241.33 rows=10788 width=0) (actual time=0.695..0.695 rows=10792 loops=1)
                                                  Index Cond: (company_id = 1)
                    ->  Materialize  (cost=133.57..10123.82 rows=5826 width=327) (actual time=0.402..7.050 rows=5898 loops=1)
                          ->  Bitmap Heap Scan on user_demographics demographics  (cost=133.57..10094.69 rows=5826 width=327) (actual time=0.396..2.768 rows=5898 loops=1)
                                Recheck Cond: (iteration_id = 2)
                                Heap Blocks: exact=301
                                ->  Bitmap Index Scan on user_demographics_iteration_id_fkey  (cost=0.00..132.11 rows=5826 width=0) (actual time=0.366..0.366 rows=5898 loops=1)
                                      Index Cond: (iteration_id = 2)
              ->  Hash  (cost=612.69..612.69 rows=247 width=242) (actual time=491.430..491.430 rows=45722 loops=1)
                    Buckets: 1024  Batches: 2 (originally 1)  Memory Usage: 4097kB
                    ->  Nested Loop  (cost=2.42..612.69 rows=247 width=242) (actual time=0.055..466.795 rows=45722 loops=1)
                          ->  Nested Loop  (cost=2.13..531.53 rows=247 width=214) (actual time=0.047..364.824 rows=45722 loops=1)
                                ->  Nested Loop  (cost=1.70..335.50 rows=340 width=28) (actual time=0.038..198.281 rows=45722 loops=1)
                                      ->  Nested Loop  (cost=1.27..57.52 rows=597 width=24) (actual time=0.032..55.728 rows=51210 loops=1)
                                            Join Filter: (questions_1.id = request_questions.question_id)
                                            ->  Nested Loop  (cost=0.84..33.77 rows=2 width=24) (actual time=0.015..0.036 rows=2 loops=1)
                                                  ->  Index Scan using questions_uuid_key on questions questions_1  (cost=0.42..16.87 rows=2 width=4) (actual time=0.008..0.017 rows=2 loops=1)
                                                        Index Cond: (uuid = ANY ('{6d5f647f-6251-4ded-bb9f-9523debf2bd5,40ff1921-9d62-4183-9de8-36d845d47494}'::uuid[]))
                                                  ->  Index Scan using questions_pkey on questions  (cost=0.42..8.44 rows=1 width=20) (actual time=0.005..0.007 rows=1 loops=2)
                                                        Index Cond: (id = questions_1.id)
                                            ->  Index Scan using request_questions_question_id_fkey on request_questions  (cost=0.43..8.15 rows=298 width=12) (actual time=0.011..15.686 rows=25605 loops=2)
                                                  Index Cond: (question_id = questions.id)
                                      ->  Index Scan using requests_pkey on requests  (cost=0.42..0.46 rows=1 width=12) (actual time=0.002..0.002 rows=1 loops=51210)
                                            Index Cond: (id = request_questions.request_id)
                                            Filter: (submitted_at IS NOT NULL)
                                            Rows Removed by Filter: 0
                                ->  Index Scan using responses_request_question_id_fkey on responses  (cost=0.43..0.57 rows=1 width=194) (actual time=0.002..0.003 rows=1 loops=45722)
                                      Index Cond: (request_question_id = request_questions.id)
                          ->  Index Only Scan using groups_pkey on groups  (cost=0.29..0.32 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=45722)
                                Index Cond: (id = requests.group_id)
                                Heap Fetches: 45722
Planning time: 6.715 ms
Execution time: 606.478 ms

After analyzing the EXPLAIN, I think the main problem is in this subquery:

SELECT questions.uuid::uuid,
    "responses"."id" as "response_id",
    'Rating'::text as "type",
    (groups.type)::text as "group_type",
    "requests"."user_id" as user_id,
    ("responses"."content"->>'rating')::numeric as "value"
  FROM responses
  INNER JOIN request_questions ON request_questions.id = responses.request_question_id AND request_questions.question_id IN (SELECT id FROM questions WHERE uuid IN ('6d5f647f-6251-4ded-bb9f-9523debf2bd5', '40ff1921-9d62-4183-9de8-36d845d47494'))
  INNER JOIN questions ON questions.id = request_questions.question_id
  INNER JOIN requests ON requests.id = request_questions.request_id
  INNER JOIN groups ON groups.id = requests.group_id
  WHERE requests.submitted_at IS NOT NULL

And I am trying to find a way to optimize that one. When there are a lot of responses for a specific set of questions, it can take a while. For the sake of the exercise, it's important to know that:

responses has an index on request_question_id
request_questions has an index on question_id and request_id
requests has an index on group_id

I also noticed that if I add an extra condition in that subquery being such as:

AND groups.iteration_id = 2

it decreases the performance, which to me is counter-intuitive because groups has an index on iteration_id.

If more information is needed, please let me know.

UPDATE Passing question ids as a parameter, as suggested per Laurenz Albe

Query

SELECT labels."labelID",
    COUNT(DISTINCT "demographics"."user_id") as "memberCount",
    AVG("t"."value") FILTER(WHERE "t"."uuid" = '6d5f647f-6251-4ded-bb9f-9523debf2bd5' AND "t"."type" = 'Rating') AS "6d5f647f-6251-4ded-bb9f-9523debf2bd5__RA",
    AVG("t"."value") FILTER(WHERE "t"."uuid" = '40ff1921-9d62-4183-9de8-36d845d47494' AND "t"."type" = 'Rating') AS "40ff1921-9d62-4183-9de8-36d845d47494__RA"
FROM (
    SELECT DISTINCT ("user_info"->>'gender') as "labelID",
        ("user_info"->>'gender') as "labelName"
    FROM "user_demographics"
    WHERE "iteration_id" = 2 AND "company_id" = 1
    LIMIT 8
) as "labels"
LEFT JOIN "user_demographics" as "demographics" on "labels"."labelID" IS NOT DISTINCT FROM ("demographics"."user_info"->>'gender')::text and "demographics"."iteration_id" = 2
LEFT JOIN (
  SELECT questions.uuid::uuid,
    "responses"."id" as "response_id",
    'Rating'::text as "type",
    (groups.type)::text as "group_type",
    "requests"."user_id" as user_id,
    ("responses"."content"->>'rating')::numeric as "value"
  FROM responses
  INNER JOIN request_questions ON request_questions.id = responses.request_question_id AND request_questions.question_id IN (88,99)
  INNER JOIN questions ON questions.id = request_questions.question_id
  INNER JOIN requests ON requests.id = request_questions.request_id
  INNER JOIN groups ON groups.id = requests.group_id
  WHERE requests.submitted_at IS NOT NULL
) as t on t.user_id = demographics.user_id
group by "labels"."labelID", "labels"."labelName"

Explain Analyze

GroupAggregate  (cost=16424.72..16434.20 rows=8 width=306) (actual time=837.872..837.873 rows=1 loops=1)
  Group Key: ((user_demographics.user_info ->> 'gender'::text)), ((user_demographics.user_info ->> 'gender'::text))
  ->  Sort  (cost=16424.72..16425.31 rows=234 width=306) (actual time=787.578..798.965 rows=46019 loops=1)
        Sort Key: ((user_demographics.user_info ->> 'gender'::text)), ((user_demographics.user_info ->> 'gender'::text))
        Sort Method: external sort  Disk: 4488kB
        ->  Nested Loop Left Join  (cost=1481.27..16415.52 rows=234 width=306) (actual time=6.380..757.756 rows=46019 loops=1)
              ->  Nested Loop Left Join  (cost=1479.27..12302.49 rows=234 width=68) (actual time=6.333..18.580 rows=5898 loops=1)
                    Join Filter: (NOT (((user_demographics.user_info ->> 'gender'::text)) IS DISTINCT FROM (demographics.user_info ->> 'gender'::text)))
                    ->  Limit  (cost=1345.56..1345.68 rows=8 width=323) (actual time=5.920..5.922 rows=1 loops=1)
                          ->  HashAggregate  (cost=1345.56..1346.42 rows=57 width=323) (actual time=5.919..5.921 rows=1 loops=1)
                                Group Key: (user_demographics.user_info ->> 'gender'::text), (user_demographics.user_info ->> 'gender'::text)
                                ->  Bitmap Heap Scan on user_demographics  (cost=374.19..1344.19 rows=275 width=323) (actual time=1.133..4.425 rows=5898 loops=1)
                                      Recheck Cond: ((iteration_id = 2) AND (company_id = 1))
                                      Heap Blocks: exact=301
                                      ->  BitmapAnd  (cost=374.19..374.19 rows=275 width=0) (actual time=1.095..1.095 rows=0 loops=1)
                                            ->  Bitmap Index Scan on user_demographics_iteration_id_fkey  (cost=0.00..132.24 rows=5843 width=0) (actual time=0.389..0.389 rows=5898 loops=1)
                                                  Index Cond: (iteration_id = 2)
                                            ->  Bitmap Index Scan on user_demographics_company_id_fkey  (cost=0.00..241.56 rows=10819 width=0) (actual time=0.685..0.685 rows=10792 loops=1)
                                                  Index Cond: (company_id = 1)
                    ->  Materialize  (cost=133.70..10153.31 rows=5843 width=327) (actual time=0.405..8.636 rows=5898 loops=1)
                          ->  Bitmap Heap Scan on user_demographics demographics  (cost=133.70..10124.10 rows=5843 width=327) (actual time=0.401..3.085 rows=5898 loops=1)
                                Recheck Cond: (iteration_id = 2)
                                Heap Blocks: exact=301
                                ->  Bitmap Index Scan on user_demographics_iteration_id_fkey  (cost=0.00..132.24 rows=5843 width=0) (actual time=0.370..0.370 rows=5898 loops=1)
                                      Index Cond: (iteration_id = 2)
              ->  Nested Loop  (cost=2.00..17.57 rows=1 width=242) (actual time=0.020..0.122 rows=8 loops=5898)
                    ->  Nested Loop  (cost=1.71..17.24 rows=1 width=214) (actual time=0.018..0.103 rows=8 loops=5898)
                          ->  Nested Loop  (cost=1.29..16.06 rows=1 width=202) (actual time=0.016..0.084 rows=8 loops=5898)
                                ->  Nested Loop  (cost=0.86..11.07 rows=1 width=16) (actual time=0.012..0.055 rows=8 loops=5898)
                                      ->  Index Scan using entity_requests_user_id_fkey on requests  (cost=0.42..6.38 rows=3 width=12) (actual time=0.003..0.007 rows=7 loops=5898)
                                            Index Cond: (user_id = demographics.user_id)
                                            Filter: (submitted_at IS NOT NULL)
                                            Rows Removed by Filter: 1
                                      ->  Index Scan using request_questions_request_id_fkey on request_questions  (cost=0.43..1.55 rows=1 width=12) (actual time=0.004..0.006 rows=1 loops=39600)
                                            Index Cond: (request_id = requests.id)
                                            Filter: (question_id = ANY ('{88,99}'::integer[]))
                                            Rows Removed by Filter: 16
                                ->  Index Scan using responses_request_question_id_fkey on responses  (cost=0.43..4.99 rows=1 width=194) (actual time=0.003..0.003 rows=1 loops=45722)
                                      Index Cond: (request_question_id = request_questions.id)
                          ->  Index Scan using questions_pkey on questions  (cost=0.42..1.16 rows=1 width=20) (actual time=0.001..0.002 rows=1 loops=45722)
                                Index Cond: (id = request_questions.question_id)
                    ->  Index Only Scan using groups_pkey on groups  (cost=0.29..0.32 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=45722)
                          Index Cond: (id = requests.group_id)
                          Heap Fetches: 45722
Planning time: 2.308 ms
Execution time: 839.766 ms

Laurenz Albe · Accepted Answer · 2019-03-13T15:43:40.400

1

Your problem is the misestimate in the index scan on request_questions (289 instead of 25605).

This leads PostgreSQL to choose a nested loop join by mistake.

You might get a better estimate if you split the query in two queries:

one that gets all the questions.id
one that takes the ids as parameter and does the rest.

In the query you modified according to the above is another misestimate in the following part:

(
    SELECT DISTINCT ("user_info"->>'gender') as "labelID",
        ("user_info"->>'gender') as "labelName"
    FROM "user_demographics"
    WHERE "iteration_id" = 2 AND "company_id" = 1
    LIMIT 8
) as "labels"
LEFT JOIN "user_demographics" as "demographics"
    on "labels"."labelID" IS NOT DISTINCT FROM
            ("demographics"."user_info"->>'gender')::text
        and "demographics"."iteration_id" = 2

That looks like there should be a simpler way to write this, but I don't know the data, so I cannot say how. A simpler condition might lead to a better estimate.

If you cannot figure it out, try to add an OR condition to the join condition that doesn't change the result. Thus will cause the optimizer to estimate higher, so the overall plan might be better (no nested loop join).

edited Mar 13 '19 at 15:43

answered Mar 12 '19 at 14:45

Laurenz Albe

209,280
17
206
263

Laureanz, would you mind clarifying how would the query end up being? Because, I made the assumption that I had the questions ids, and modify the existing query so that it accepts those ids: ```INNER JOIN request_questions ON request_questions.id = responses.request_question_id AND request_questions.question_id IN (80,99)``` But that doesn't seem to have improved the overall performance, it actually has decreased. I'm happy to provide another EXPLAIN, but do you mind confirming this is what you were suggesting first? – Nobita Mar 12 '19 at 15:04
I have updated the answer with the version of passing question ids as a parameter and its correspondent EXPLAIN. – Nobita Mar 12 '19 at 15:21
I have no solution, but a suggestion. See my updated answer. – Laurenz Albe Mar 12 '19 at 22:46
Thank you, I will see what I can do. Basically I have a source to get what I know as 'labels' and I want to left join that with my `user_demographics` to make sure that I get as the results all the labels and get the user which a specific attribute in `user_info` is not distinct from the value that comes from labels. The reason I write the 'IS NOT DISTINCT FROM' is because sometimes the labels can be `NULL` and if I do `=` it doesn't match it with a potential user demographics which user info attribute is NULL too. This is a simple explanation of the underlying data and needs of this query. – Nobita Mar 13 '19 at 09:07
Laurenz, when you say "a simple condition" do you mean that IS NOT DISTINCT FROM and AND are too complex? – Nobita Mar 13 '19 at 09:14
The JSON attribute access is a problem. PostgreSQL cannot estimate that well. Sorry that I cannot tell you exactly how to rewrite it in a simpler way, maybe with your knowledge of the data and the requirements you can come up with something simpler. Is there no way to avoid joining `user_demographics` twice? – Laurenz Albe Mar 13 '19 at 09:21
I'm joining with user_demographics once but I am using it in the subquery of the FROM to get all the potential values. That subquery is dynamic, sometimes I use user_demographics and some other times I use another source to gather the labels. I'm going to see if I can find a way to change the query and see if that helps PostgreSQL to plan a better query. Thanks Laurenz! – Nobita Mar 13 '19 at 10:25

Improve performance of PostgreSQL query with LEFT JOIN and subquery

1 Answers1