How to improve performance when combining two queries with 'IN' operator

Question

I'm using postgres 9.4

select version();
                                                    version                                                    
---------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.4.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-11), 64-bit

I have a view in my database that has two columns, one integer and one text.

\d+ gff_attributes
+----------------+---------+-------------+-----------+---------------+
| Column         | Type    | Modifiers   | Storage   |   Description |
|----------------+---------+-------------+-----------+---------------|
| seqfeature_id  | integer |             | plain     |        <null> |
| gff_attributes | text    |             | extended  |        <null> |
+----------------+---------+-------------+-----------+---------------+
View definition:
 SELECT qv.seqfeature_id,
    string_agg((t.name::text || '='::text) || qv.value, ';'::text
      ORDER BY t.name) AS gff_attributes
   FROM term t,
    seqfeature_qualifier_value qv
  WHERE qv.term_id = t.term_id
  GROUP BY qv.seqfeature_id;

That combines data from two tables seqfeature_qualifier_value(~55,000,000 rows) and term (~11,000 rows)

\d+ seqfeature_qualifier_value
               Table "public.seqfeature_qualifier_value"
    Column     |  Type   |     Modifiers      | Storage  | Description 
---------------+---------+--------------------+----------+-------------
 seqfeature_id | integer | not null           | plain    | 
 term_id       | integer | not null           | plain    | 
 rank          | integer | not null default 0 | plain    | 
 value         | text    | not null           | extended | 
Indexes:
    "seqfeature_qualifier_value_pkey" PRIMARY KEY, btree (seqfeature_id, term_id, rank)
    "seqfeaturequal_sfid" btree (seqfeature_id)
    "seqfeaturequal_trm" btree (term_id)
    "seqfeaturequal_type_value" btree (term_id, value)
Foreign-key constraints:
    "fkseqfeature_featqual" FOREIGN KEY (seqfeature_id) REFERENCES seqfeature(seqfeature_id) ON DELETE CASCADE
    "fkterm_featqual" FOREIGN KEY (term_id) REFERENCES term(term_id)
Rules:
    rule_seqfeature_qualifier_value_i AS
    ON INSERT TO seqfeature_qualifier_value
   WHERE (( SELECT seqfeature_qualifier_value.seqfeature_id
           FROM seqfeature_qualifier_value
          WHERE seqfeature_qualifier_value.seqfeature_id = new.seqfeature_id AND seqfeature_qualifier_value.term_id = new.term_id AND seqfeature_qualifier_value.rank = new.rank)) IS NOT NULL DO INSTEAD NOTHING
Has OIDs: no

\d+ term
                                                Table "public.term"
   Column    |          Type          |                     Modifiers                     | Storage  | Description 
-------------+------------------------+---------------------------------------------------+----------+-------------
 term_id     | integer                | not null default nextval('term_pk_seq'::regclass) | plain    | 
 name        | character varying(255) | not null                                          | extended | 
 definition  | text                   |                                                   | extended | 
 identifier  | character varying(40)  |                                                   | extended | 
 is_obsolete | character(1)           |                                                   | extended | 
 ontology_id | integer                | not null                                          | plain    | 
Indexes:
    "term_pkey" PRIMARY KEY, btree (term_id)
    "term_identifier_key" UNIQUE, btree (identifier)
    "term_name_ontology_id_is_obsolete_key" UNIQUE, btree (name, ontology_id, is_obsolete)
    "term_ont" btree (ontology_id)
Foreign-key constraints:
    "fkont_term" FOREIGN KEY (ontology_id) REFERENCES ontology(ontology_id) ON DELETE CASCADE
Rules:
    rule_term_i1 AS
    ON INSERT TO term
   WHERE (( SELECT term.term_id
           FROM term
          WHERE term.identifier::text = new.identifier::text)) IS NOT NULL DO INSTEAD NOTHING
    rule_term_i2 AS
    ON INSERT TO term
   WHERE (( SELECT term.term_id
           FROM term
          WHERE term.name::text = new.name::text AND term.ontology_id = new.ontology_id AND term.is_obsolete = new.is_obsolete)) IS NOT NULL DO INSTEAD NOTHING
Has OIDs: no

Now if I want to select a subset of rows based on the seqfeature_id column I can get the results quite quickly using an explicit comparison:

    explain (analyze, verbose) select *
     from gff_attributes
     where seqfeature_id = 3596159;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                                                                |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| GroupAggregate  (cost=337.27..734.68 rows=1 width=24) (actual time=11.690..11.690 rows=1 loops=1)                                                                         |
|   Output: qv.seqfeature_id, string_agg((((t.name)::text || '='::text) || qv.value), ';'::text ORDER BY t.name)                                                            |
|   Group Key: qv.seqfeature_id                                                                                                                                             |
|   ->  Hash Join  (cost=337.27..733.56 rows=110 width=24) (actual time=11.600..11.628 rows=6 loops=1)                                                                      |
|         Output: t.name, qv.seqfeature_id, qv.value                                                                                                                        |
|         Hash Cond: (qv.term_id = t.term_id)                                                                                                                               |
|         ->  Index Scan using seqfeaturequal_sfid on public.seqfeature_qualifier_value qv  (cost=0.56..394.66 rows=110 width=17) (actual time=0.036..0.055 rows=6 loops=1) |
|               Output: qv.seqfeature_id, qv.term_id, qv.rank, qv.value                                                                                                     |
|               Index Cond: (qv.seqfeature_id = 3596159)                                                                                                                    |
|         ->  Hash  (cost=194.09..194.09 rows=11409 width=15) (actual time=11.539..11.539 rows=11413 loops=1)                                                               |
|               Output: t.name, t.term_id                                                                                                                                   |
|               Buckets: 2048  Batches: 1  Memory Usage: 540kB                                                                                                              |
|               ->  Seq Scan on public.term t  (cost=0.00..194.09 rows=11409 width=15) (actual time=0.009..5.108 rows=11413 loops=1)                                        |
|                     Output: t.name, t.term_id                                                                                                                             |
| Planning time: 0.455 ms                                                                                                                                                   |
| Execution time: 11.753 ms                                                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

But when it's combined with a query that returns many seqfeature_id using the IN operator things slow down considerably (~2min)

    explain (analyse, verbose)
     select * from gff_attributes
     where seqfeature_id in (
         select seqfeature_id
         from seqfeature_qualifier_value
         where term_id = (select term_id
             from term
             where name = 'SRB_ortholog_id')
         and value = '1')
      ;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                                                                    |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Merge Join  (cost=12911531.62..13619325.85 rows=251228 width=36) (actual time=121504.409..173449.696 rows=102 loops=1)                                                        |
|   Output: qv.seqfeature_id, (string_agg((((t.name)::text || '='::text) || qv.value), ';'::text ORDER BY t.name))                                                              |
|   Merge Cond: (qv.seqfeature_id = seqfeature_qualifier_value.seqfeature_id)                                                                                                   |
|   InitPlan 1 (returns $0)                                                                                                                                                     |
|     ->  Index Scan using term_name_ontology_id_is_obsolete_key on public.term  (cost=0.29..8.30 rows=1 width=4) (actual time=0.036..0.037 rows=1 loops=1)                     |
|           Output: term.term_id                                                                                                                                                |
|           Index Cond: ((term.name)::text = 'SRB_ortholog_id'::text)                                                                                                           |
|   ->  GroupAggregate  (cost=12905524.15..13607037.46 rows=502457 width=24) (actual time=121295.372..172418.928 rows=3687424 loops=1)                                          |
|         Output: qv.seqfeature_id, string_agg((((t.name)::text || '='::text) || qv.value), ';'::text ORDER BY t.name)                                                          |
|         Group Key: qv.seqfeature_id                                                                                                                                           |
|         ->  Sort  (cost=12905524.15..13044570.67 rows=55618608 width=24) (actual time=121295.315..132671.659 rows=22189814 loops=1)                                           |
|               Output: qv.seqfeature_id, t.name, qv.value                                                                                                                      |
|               Sort Key: qv.seqfeature_id                                                                                                                                      |
|               Sort Method: external merge  Disk: 1639072kB                                                                                                                    |
|               ->  Hash Join  (cost=336.70..2328594.94 rows=55618608 width=24) (actual time=13.358..41289.820 rows=55545757 loops=1)                                           |
|                     Output: qv.seqfeature_id, t.name, qv.value                                                                                                                |
|                     Hash Cond: (qv.term_id = t.term_id)                                                                                                                       |
|                     ->  Seq Scan on public.seqfeature_qualifier_value qv  (cost=0.00..1215886.08 rows=55618608 width=17) (actual time=0.063..12230.988 rows=55545757 loops=1) |
|                           Output: qv.seqfeature_id, qv.term_id, qv.rank, qv.value                                                                                             |
|                     ->  Hash  (cost=194.09..194.09 rows=11409 width=15) (actual time=13.278..13.278 rows=11413 loops=1)                                                       |
|                           Output: t.name, t.term_id                                                                                                                           |
|                           Buckets: 2048  Batches: 1  Memory Usage: 540kB                                                                                                      |
|                           ->  Seq Scan on public.term t  (cost=0.00..194.09 rows=11409 width=15) (actual time=0.011..6.207 rows=11413 loops=1)                                |
|                                 Output: t.name, t.term_id                                                                                                                     |
|   ->  Sort  (cost=5999.16..5999.20 rows=14 width=4) (actual time=0.404..0.436 rows=102 loops=1)                                                                               |
|         Output: seqfeature_qualifier_value.seqfeature_id                                                                                                                      |
|         Sort Key: seqfeature_qualifier_value.seqfeature_id                                                                                                                    |
|         Sort Method: quicksort  Memory: 29kB                                                                                                                                  |
|         ->  HashAggregate  (cost=5998.76..5998.90 rows=14 width=4) (actual time=0.345..0.368 rows=102 loops=1)                                                                |
|               Output: seqfeature_qualifier_value.seqfeature_id                                                                                                                |
|               Group Key: seqfeature_qualifier_value.seqfeature_id                                                                                                             |
|               ->  Bitmap Heap Scan on public.seqfeature_qualifier_value  (cost=88.22..5994.94 rows=1527 width=4) (actual time=0.102..0.290 rows=102 loops=1)                  |
|                     Output: seqfeature_qualifier_value.seqfeature_id, seqfeature_qualifier_value.term_id, seqfeature_qualifier_value.rank, seqfeature_qualifier_value.value   |
|                     Recheck Cond: ((seqfeature_qualifier_value.term_id = $0) AND (seqfeature_qualifier_value.value = '1'::text))                                              |
|                     Heap Blocks: exact=102                                                                                                                                    |
|                     ->  Bitmap Index Scan on seqfeaturequal_type_value  (cost=0.00..87.83 rows=1527 width=0) (actual time=0.083..0.083 rows=102 loops=1)                      |
|                           Index Cond: ((seqfeature_qualifier_value.term_id = $0) AND (seqfeature_qualifier_value.value = '1'::text))                                          |
| Planning time: 1.010 ms                                                                                                                                                       |
| Execution time: 173942.270 ms                                                                                                                                                 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Note that when running the subquery by itself it is also fast (<1s) and returns 102 rows

    explain (analyse, verbose)
     select seqfeature_id
     from seqfeature_qualifier_value
     where term_id = (select term_id
         from term where name = 'SRB_ortholog_id'
         )
     and value = '1'
          ;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                                                |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| Bitmap Heap Scan on public.seqfeature_qualifier_value  (cost=96.52..6003.24 rows=1527 width=4) (actual time=0.104..0.319 rows=102 loops=1)                |
|   Output: seqfeature_qualifier_value.seqfeature_id                                                                                                        |
|   Recheck Cond: ((seqfeature_qualifier_value.term_id = $0) AND (seqfeature_qualifier_value.value = '1'::text))                                            |
|   Heap Blocks: exact=102                                                                                                                                  |
|   InitPlan 1 (returns $0)                                                                                                                                 |
|     ->  Index Scan using term_name_ontology_id_is_obsolete_key on public.term  (cost=0.29..8.30 rows=1 width=4) (actual time=0.035..0.037 rows=1 loops=1) |
|           Output: term.term_id                                                                                                                            |
|           Index Cond: ((term.name)::text = 'SRB_ortholog_id'::text)                                                                                       |
|   ->  Bitmap Index Scan on seqfeaturequal_type_value  (cost=0.00..87.83 rows=1527 width=0) (actual time=0.083..0.083 rows=102 loops=1)                    |
|         Index Cond: ((seqfeature_qualifier_value.term_id = $0) AND (seqfeature_qualifier_value.value = '1'::text))                                        |
| Planning time: 0.215 ms                                                                                                                                   |
| Execution time: 0.368 ms                                                                                                                                  |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+

I'm very confused as to why the IN operator adds so much time to the query? Is there a way that I can re-write this query to improve performance?

Please [edit] your question and add the output of `explain (analyze, verbose)` to your question instead of the simple `explain` output — , Mar 29 '16 at 22:10
You forgot to provide your version of Postgres - among other things. Please consider instructions in the [tag info for \[postgresql-performance\]](http://stackoverflow.com/tags/postgresql-performance/info). And we need exact table definitions (at least including relevant columns) and cardinalities for the underlying tables `term` and `seqfeature_qualifier_value`. — Erwin Brandstetter, Mar 30 '16 at 00:11
My apologies, I hadn't read the info page. I've updated my question to include the table schema, postgres version and approximate number of rows. — cts, Mar 30 '16 at 04:15
What about putting the sub-select in a CTE? Does that change anything. Something like: http://hastebin.com/ixiyiyurov.cs — , Mar 30 '16 at 06:11
no it doesn't improve things. However if I replace the call to the view with the query I used to make the view it runs in less than a second. — cts, Mar 30 '16 at 06:50
My first step would be to replace the two separate indexes `seqfeaturequal_sfid btree (seqfeature_id), seqfeaturequal_trm btree (term_id)` by a **composite index** on the two combined: `some_index_name(term_id, seqfeature_id)` Thi index can be UNIQUE (it is the primary key in reverser order) Such an index should IMHO be standard practice for *fat junction* tables. (maybe similar for the `value` column and combinations thereof) The joys of EAV... If the `values` column has low cardinality (a few thousands distinct values),you could also attempt to squeeze that out into a separate domain-table. — wildplasser, Mar 30 '16 at 10:11
BTW: what is `rank` ? This looks like a tabing-order field, and if it is, it should IMHO **not** be part of a primary key. (at most: part of another candidate key) — wildplasser, Mar 30 '16 at 11:50

score 0 · Answer 1 · answered Mar 29 '16 at 22:42

0

Something like this?

SELECT
t1.*
FROM
gff_attributes t1
INNER JOIN
(
    SELECT DISTINCT t3.seqfeature_id 
    FROM seqfeature_qualifier_value t3
    INNER JOIN term t4 on t3.term_id = t4.term_id AND t4.name = 'SRB_ortholog_id'
    WHERE
    t3.value = '1'
) t2 ON t1.seqfeature_id = t2.seqfeature_id

answered Mar 29 '16 at 22:42

Andy

56
4

sadly that way is no faster – cts Mar 29 '16 at 23:24
What type of database are you using? If you're using Oracle, add the optimizer hint /*+ MATERIALIZE */ to the sub-select portion of the query. – Andy Mar 29 '16 at 23:33
Strike that - from your syntax I gather you are using a PostgreSQL database. You might toy with creating a temporary table as described in this link for the sub-select section. http://stackoverflow.com/questions/15306199/materialize-common-table-expression-in-greenplum – Andy Mar 29 '16 at 23:41

score 0 · Answer 2 · answered Mar 30 '16 at 02:18

How about:

select a.* 
from
    gff_attributes as a
    join
    seqfeature_qualifier_value as b on
        a.seqfeature_id = b.seqfeature_id
        and
        b.value = '1'
    join
    term as c on
        b.term_id = c.term_id
        and
        c.name = 'SRB_ortholog_id';

score 0 · Answer 3 · answered Mar 30 '16 at 02:34

Generally, nested queries/subqueries are expensive, whether you use IN, or JOIN, or WHERE EXISTS. I've tried each in Transact-SQL and find that each has the exact same execution plan, so they are equal in terms of performance, at least where T-SQL is concerned.

A standard workaround is to pull your first query into a temporary table as Andy mentions, add an index to it (using ALTER TABLE), and run your subquery against the indexed temporary table. This will run faster in most flavors of SQL. If you want to dig deeper, google "performance issues with subqueries postgresql". You'll find lots of posts attempting to deal with the same problem.

wildplasser · Answer 4 · 2016-03-30T20:24:26.900

0

First: replace the IN + scalar subquery(yuk!) by an EXISTS clause(and add some aliases for mental sanity):

SELECT *
FROM gff_attributes ga
WHERE EXISTS ( SELECT 13
     FROM seqfeature_qualifier_value sqv
       JOIN term t ON t.term_id = sqv.term_id
     WHERE ga.seqfeature_id = sqv.seqfeature_id
     AND sqv.value = '1'
     AND t.name = 'SRB_ortholog_id'
 );

Next: In the fat junction table (or: value table), I would suggest replacing the two single-column indexes for term and feature by a single composite index. This is in fact the primary key in reversed order. (BTW is the rank field really needed to enforce uniqueness? what does it mean?)

DROP INDEX seqfeaturequal_sfid; -- (seqfeature_id)
DROP INDEX seqfeaturequal_trm;  -- (term_id)
    -- WHAT is "rank" ? Why is it needed?
CREATE UNIQUE INDEX seqfeaturequal_trm_sfid
  ON seqfeature_qualifier_value (term_id,seqfeature_id,rank);

And, of course you should alseo run ANALYZE seqfeature_qualifier_value; after adding the indexes, to refresh the statistics.

And: you should probably add a UNIQUE constraint on term.name; uou use it in a scalar subquery, so you are supposing it to be unique.

edited Mar 30 '16 at 20:24

answered Mar 30 '16 at 11:33

wildplasser

43,142
8
66
109

Hi, thanks for the tips about the indexes. Your suggested query doesn't seem to work as there is on 'value' column in the gff_attributes view. The columns that it does have are the integer for the `seqfeature_id` and text `gff_attributes` column which will contain a concatenation of many attributes. I could do something like, `like '%1%'` but I think that would be very slow. – cts Mar 30 '16 at 19:55
Thats why I added the comment about adding table aliases to your queries. In: `... and value = '1')` , I had to guess which RTE the value was supposed to come from. I probably guessed wrong... – wildplasser Mar 30 '16 at 20:15
Okay I fixed it. It was in the wrong query level. You still need to answer about the `rank` column. (I suspect this could be a table with 3 candidate keys) – wildplasser Mar 30 '16 at 20:24
the `rank` field is when multiple values with the same `term_id` are added to the same `seqfeature_id`. It will store the order in which they were added. – cts Mar 30 '16 at 20:55
Such that your `string_agg((t.name::text || '='::text) || qv.value, ';'::text` can result in multiple name=value pairs for the same value. (but unordered by rank) – wildplasser Mar 30 '16 at 21:16
yeah in this output I don't care about ordering the same name based on the rank value – cts Mar 30 '16 at 21:19

How to improve performance when combining two queries with 'IN' operator

4 Answers4