Why does Postgres do a sequential scan where the index would return < 1% of the data?

Question

I have 19 years of Oracle and MySQL experience (DBA and dev) and I am new to Postgres, so I may be missing something obvious. But I can not get this query to do what I want.

NOTE: This query is running on an EngineYard Postgres instance. I am not immediately aware of the parameters it has set up. Also, columns applicable_type and status in the items table are of extension type citext.

The following query can take in excess of 60 seconds to return rows:

SELECT items.item_id, 
       CASE when items.sku is null then items.title else concat(item.title, ' (SKU: ', items.sku, ')') END title, 
       items.listing_status, items.updated_at, items.id, 
       items.sku, count(details.id) detail_count 
FROM "items" LEFT OUTER JOIN details ON details.applicable_id = items.id 
                                    and details.applicable_type = 'Item' 
                                    and details.status = 'Valid' 
                LEFT OUTER JOIN products ON products.id = items.product_id
WHERE "items"."user_id" = 3
GROUP BY items.id
ORDER BY title asc
LIMIT 25 OFFSET 0

The details table contains 6.5M rows. The LEFT OUTER JOIN to it does a sequential scan on applicable_id. Cardinality-wise, that column has 120K distinct possibilities across 6.5M rows.

I have a btree index on details with the following columns:

applicable_id
applicable_type
status

but really, applicable_id and applicable_type have low cardinality.

My explain analyze looks like this:

Limit  (cost=247701.59..247701.65 rows=25 width=118) (actual time=28781.090..28781.098 rows=25 loops=1)
  ->  Sort  (cost=247701.59..247703.05 rows=585 width=118) (actual time=28781.087..28781.090 rows=25 loops=1)
      Sort Key: (CASE WHEN (items.sku IS NULL) THEN (items.title)::text ELSE pg_catalog.concat(items.title, ' (SKU: ', items.sku, ')') END)
      Sort Method: top-N heapsort  Memory: 30kB
      ->  HashAggregate  (cost=247677.77..247685.08 rows=585 width=118) (actual time=28779.658..28779.974 rows=664 loops=1)
          ->  Hash Right Join  (cost=2069.47..247645.64 rows=6425 width=118) (actual time=17798.898..28742.395 rows=60047 loops=1)
                Hash Cond: (details.applicable_id = items.id)
                ->  Seq Scan on details  (cost=0.00..220591.65 rows=6645404 width=8) (actual time=6.272..27702.717 rows=6646205 loops=1)
                      Filter: ((applicable_type = 'Listing'::citext) AND (status = 'Valid'::citext))
                      Rows Removed by Filter: 942
                ->  Hash  (cost=2062.16..2062.16 rows=585 width=118) (actual time=1.286..1.286 rows=664 loops=1)
                      Buckets: 1024  Batches: 1  Memory Usage: 90kB
                      ->  Bitmap Heap Scan on items  (cost=16.87..2062.16 rows=585 width=118) (actual time=0.157..0.748 rows=664 loops=1)
                            Recheck Cond: (user_id = 3)
                            ->  Bitmap Index Scan on index_items_on_user_id  (cost=0.00..16.73 rows=585 width=0) (actual time=0.141..0.141 rows=664 loops=1)
                                  Index Cond: (user_id = 3)

Total runtime: 28781.238 ms

Curious to know... what kind of plan were you expecting it to yield? Top-n sort of the big aggregate seems about right, no? (Also: shouldn't that be count(distinct details.id), seeing the left join on products?) — Denis de Bernardy, Oct 17 '13 at 21:47
You can try to temporary disable hash joins with `set enable_hashjoin = false;` to see a different query plan and compare it with current. — Ihor Romanchenko, Oct 17 '13 at 22:16
Also, to get better performance you can apply `LIMIT` to `items` in a subquery befre joining with other tables. — Ihor Romanchenko, Oct 17 '13 at 22:18
BTW the plan you are showing is from a different query - it has table `apps` instead of `details` and does not join `products`. — Ihor Romanchenko, Oct 17 '13 at 22:24
as mentioned above the tables are different in the analysis than in the query, perhaps your `apps` table does not have the `applicable_id`,`applicable_type` and `status` indexed because it should not be doing a seq scan on a join. — Lucas, Oct 18 '13 at 11:02
I was just trying to obfuscate the table names and neglected to do it in the explain. I assure you that the analysis is from the query. Yes, there is an index that is applicable_id, applicable_type, status, in that order. It is being ignored. — RubyRedGrapefruit, Oct 18 '13 at 13:42

Denis de Bernardy · Accepted Answer · 2013-10-17T21:55:41.867

3

Do you have an index on the expression that yields the title? Better yet, one on (user_id, title_expression).

If not, that might be an excellent thing to add, so as to nestloop through the first 25 rows of an index scan, seeing that Postgres can't reasonably guess which random 25 rows (hence the seq scan you're currently getting on the joined table) will be needed.

edited Oct 17 '13 at 21:55

answered Oct 17 '13 at 21:50

Denis de Bernardy

75,850
13
131
154

I have a unique index with (user_id, sku, title, item_id, id). The seq scan is still showing up even without the limit. – RubyRedGrapefruit Oct 18 '13 at 14:07
2

@AKWF: Perhaps, but seeing your query, the index that you need to avoid the seq scan is on (user_id, CASE when sku is null then title else concat(title, ' (SKU: ', sku, ')') END)... Else, there's absolutely no way that Postgres can reasonably guess which are going to be the top 25 rows. Or then, order by items.title, items.sku (as in the raw data) and then add a more useful index on (user_id, title, sku). – Denis de Bernardy Oct 18 '13 at 14:47

score 1 · Answer 2 · answered Oct 18 '13 at 02:46

I think you need an index on applicable_id column only (without applicable_type, status columns). You may also need to increase default_statistics_target param (system wide or better for applicable_id column only) so postgresql had better guess about number of rows in joining.

Why does Postgres do a sequential scan where the index would return < 1% of the data?

2 Answers2

Linked