postgres performance issue with nested sql sub-queries

Question

To simplify the case let's assume there are the following 3 tables

A(a_id), B(b_id,val_b), C(a_id,b_id,val_c)

I need to find all a_id that have certain value pairs from B and C. Example find all a_id which have records (val_b='1' and val_c='2' and B.b_id=C.b_id) AND (val_b='3' and val_c='4' and B.b_id=C.b_id) AND ...

select A.a_id
from A
where (A.a_id in 
        (select C.a_id 
         from B, C 
         where B.b_id=C.b_id and B.val_b='1' and C.val_c='2') and
       A.a_id in 
        (select C.a_id 
         from B, C 
         where B.b_id=C.b_id and B.val_b='3' and C.val_c='4') and
       A.a_id in 
        (select C.a_id 
         from B, C 
         where B.b_id=C.b_id and B.val_b='5' and C.val_c='6'));

What I noticed is that by adding a few more (val_b,val_c) additional pairs postgres takes substantial time to perform the query. To note thatindexes are present for ids, val_b and val_c.

Is there a way to optimize the query? Tried explicit inner joins but didn't help to improve the performance.

Thanks in advance

More info:

postgres version 8.2.4
With only one pair criteria runs in 77.621ms
With 2 pair criteria - 151.588 ms
With 3 pairs of criteria - 49483.979 ms <-- performance sparks crazy
Note that separate subquery itself runs under ~62ms.

Update:

Both separate INTERSECT query version suggested below by Vladimir Baranov and the version with having clause that uses bool_or aggregation function by Clodoaldo Neto performed much better. Thank you !

However, the question remains why postgres 8.2 has such a performance spark with original query starting with 3 pair criteria?

BTW, I noticed the same spark also with Vladimir Baranov`s first suggestion to rewrite the query with clean joins. See below:

SELECT A.a_id
FROM
    A
    INNER JOIN (SELECT C.a_id FROM B INNER JOIN C ON B.b_id=C.b_id WHERE B.val_b='1' and C.val_c='2') Set1 ON Set1.a_id = A.a_id
    INNER JOIN (SELECT C.a_id FROM B INNER JOIN C ON B.b_id=C.b_id WHERE B.val_b='3' and C.val_c='4') Set2 ON Set2.a_id = A.a_id
    INNER JOIN (SELECT C.a_id FROM B INNER JOIN C ON B.b_id=C.b_id WHERE B.val_b='5' and C.val_c='6') Set3 ON Set3.a_id = A.a_id
;

With 3 sets th query runs quite fast, but as soon as one adds another 3-4 sets the query performance degrades to ~30-40 secs.

The values are actually strings. I just used numerical values in this example but can be any string upto 255 chars long. — user1039384, Feb 23 '15 at 23:32
8.2? Really? 8.x is [no longer maintained](http://www.postgresql.org/support/versioning/) (=supported). You should upgrade to 9.4 as soon as possible. — , Feb 24 '15 at 07:12

Vladimir Baranov · Answer 1 · 2015-03-03T23:15:31.157

It would be interesting to see if the following runs faster:

SELECT A.a_id
FROM A
WHERE
    A.a_id IN
    (
        SELECT C.a_id
        FROM B INNER JOIN C ON B.b_id=C.b_id
        WHERE B.val_b='1' and C.val_c='2'

        INTERSECT

        SELECT C.a_id
        FROM B INNER JOIN C ON B.b_id=C.b_id
        WHERE B.val_b='3' and C.val_c='4'

        INTERSECT

        SELECT C.a_id
        FROM B INNER JOIN C ON B.b_id=C.b_id
        WHERE B.val_b='5' and C.val_c='6'
    )
;

Effectively, instead of multiple IN here are explicit intersection of multiple subsets.

My original answer had a query that was not equivalent to the original query of the question.

Here is SQL Fiddle with some sample data and original query to check that my variant produces same results as original query.

edit

One more path to investigate. If each of the sub-queries run quickly, but INTERSECT repeated many times in one long query becomes very slow, then you can try to populate a temporary table with the results of sub-queries and then use this temporary table with the main table A. Effectively, implement INTERSECT manually one-set-at-a-time using explicit temporary table. Depending on the number of rows returned by sub-queries it may be beneficial to add an index to the temporary table.

update

As for your question why Postgres performance degrades when query becomes complex... Your version of Postgres is rather old and it is unlikely that somebody would be interested enough to investigate in detail. I can offer only some generic thoughts. The latest version most likely would perform differently, there were a lot of changes since 8.2.

In every RDBMS the query optimizer has limited resources and time to analyze the query, so they use a lot of heuristics. As number of joins in the query increases the complexity of the problem to find the optimal execution plan increases exponentially, so there must be a threshold after which the optimizer gives up and picks whatever plan he's got.

You should be able to observe it. Examine the execution plan of the fast query, add another join to make the query slow and compare the plans. Most likely the plans would be very different. You should be able to determine what paths optimizer chooses in each case.

It could be that when given a query with few joins optimizer is able to transform it to a variant equivalent to using intersect, but with large number of joins it can't do it any more and just follows the query flow doing join after join. It may even do it so inefficiently, that it ends up doing loop inside the loop inside the loop..., in other words the complexity jumps from linear to quadratic or worse.

So, really, the only answer to such performance questions is: examine the execution plan.

BTW, The latest versions of Postgres have WITH, which effectively creates a temporary table with intermediate results. It should help in your case greatly, because each of your subqueries is simple and if the system runs all of them separately at first, then it would be easy to combine results together.

Vladimir, thank you for feedback. Mine does not support WITH but per your suggestion I inlined the selects. So for 3 sets it certainly outperforms many times thr original version, but as soon as I tried 7-8 sets it again took some 4 minutes to execute the sql. I wonder maybe I am hitting now from_collapse_limit and join_collapse_limit postgres default configuration values? — user1039384, Feb 24 '15 at 03:53
@user1039384, my first answer was not correct. It was not equivalent to your query. It would produce duplicate rows if `a_id` was duplicated in `C`. My second variant with `INTERSECT` should be correct. Please try it. — Vladimir Baranov, Feb 24 '15 at 03:59
@user1039384, what did you eventually choose to do? If any of the answers provided here were useful, please upvote them and accept the answer that was most useful. — Vladimir Baranov, Feb 28 '15 at 10:40
@user1039384, was my additional explanation of how the query complexity can affect the performance satisfactory? Did you have a chance to examine your execution plans and notice how they change as the complexity of the query increases? — Vladimir Baranov, Mar 12 '15 at 22:54

score 1 · Answer 2 · answered Feb 24 '15 at 12:56

select a_id
from
    a
    inner join
    c using (a_id)
    inner join
    b using (b_id)
group by a_id
having
    bool_or((val_b, val_c) = (1,2)) and
    bool_or((val_b, val_c) = (3,4)) and
    bool_or((val_b, val_c) = (5,6))

http://www.postgresql.org/docs/8.2/static/functions-aggregate.html

Randall · Answer 3 · 2015-02-24T01:48:25.657

Each subquery has to hit the indexes again, which increases the overhead of the query by several times. If I understand what your asking, This is a case for the Or operator:

select a.a_id
from A
  join c on a.a_id = c.a_id
  join b on b.b_id = c.b_id
where 
(
  (b.val_b = '1' and c.val_c = '2')
  or (b.val_b = '3' and c.val_c = '4')
  or (b.val_b = '5' and c.val_c = '6')
)

This will give you all A records that are linked to a C record where the c and b values are one of the sets you mentioned. Hope this helps :)

edit Seems it's a many to one:

Select a.a_id
    , sum(case when b.val_b = '1' and c.val_c = '2' then 1 else 0 end) as Condition1
    , Sum(case when b.val_b = '3' and c.val_c = '4' then 1 else 0 end) as Condition2
    , Sum(case when b.val_b = '5' and c.val_c = '6' then 1 else 0 end) as Condition3
from A
  join c on a.a_id = c.a_id
  join b on b.b_id = c.b_id
group by a.a_id
having sum(case when b.val_b = '1' and c.val_c = '2' then 1 else 0 end) > 0
    and Sum(case when b.val_b = '3' and c.val_c = '4' then 1 else 0 end) > 0
    and Sum(case when b.val_b = '5' and c.val_c = '6' then 1 else 0 end) > 0

Hopefully that gets you there,

even shorter: `where (b.val_b, c.val_c) in ( ('1', '2'), ('3', '4'), ('5', '6'))` — , Feb 23 '15 at 22:54
i actually need to have only a_ids that meet ALL criteria. Your proposed solution will find records muching ANY of the criteria. — user1039384, Feb 23 '15 at 23:34
Must be a many to one relationship then, so we will have to get a bit more clever. let me edit in a many to one solution since I'm restricted on characters. — Randall, Feb 24 '15 at 01:31

score 0 · Answer 4 · answered Feb 24 '15 at 12:22

0

upgrade to a recent version
Use the JOIN syntax for clarity
use EXISTS(...) instead of IN(...) for speed and comfort
PK/FK and indexes do help!

SELECT A.a_id
FROM A
WHERE EXISTS (
        SELECT *
        FROM B
        JOIN C ON B.b_id = C.b_id AND B.val_b = '1' 
        WHERE C.a_id = A.a_id AND C.val_c = '2'
        )
AND EXISTS (
        SELECT *
        FROM B
        JOIN C ON B.b_id = C.b_id AND B.val_b = '3' 
        WHERE C.a_id = A.a_id AND C.val_c = '4'
        )
AND EXISTS (
        SELECT *
        FROM B
        JOIN C ON B.b_id = C.b_id AND B.val_b = '5' 
        WHERE C.a_id = A.a_id AND C.val_c = '6'
        )
        ;

answered Feb 24 '15 at 12:22

joop

4,330
1
15
26

I benchmark-ed this version but it produced similar poor performance as the original query. EXISTS didn't help much for this case unfortunately. Trying to understand where that sudden performance drop is originating. Maybe postgres configuration issue?? – user1039384 Mar 03 '15 at 19:14
Your poor performance is due to the lack of indexes. You need indexes *at least* for the key / join fields. – joop Mar 05 '15 at 10:57

postgres performance issue with nested sql sub-queries

4 Answers4