How to perform fast join on postgresql

Question

using postgres-9.4 on ubuntu-15.10

Performing join on two tables on indexed column and get Seq Scan

table a : ~7 million rows, contains char()&varchar()&float, have clustered b-tree index on date char(11). 1~2500 rows on each date.

table b : ~6k rows,contains 2 cols date char(11) and feature float[] also have clustered b-tree index on date char(11). There is only 1 record on each date.

I have the following query, cols in select are too many and too ugly so I make it short:

SELECT
  6 regexp_split_to_array(a.char_cols),
  ARRAY[ a.all_float] || b.feature
FROM
  a JOIN b ON
a.date = b.date;

It takes ~50000ms and use EXPLAIN ANALYZE VERBOSE I have following result:

QUERY PLAN
-----------------------------------------------------------------------------
 Hash Join  (cost=501.67..790703.48 rows=7154354 width=856) (actual time=3.216..49223.885 rows=7154589 loops=1)
   Output: A_LOT_MESS
   Hash Cond: (a.date = b.date)
   ->  Seq Scan on public.a  (cost=0.00..521913.54 rows=7154354 width=440) (actual time=0.001..967.771 rows=7154589 loops=1)
     Output: EVERY_COLS_OF_A
   ->  Hash  (cost=424.63..424.63 rows=6163 width=428) (actual time=3.157..3.157 rows=6163 loops=1)
     Output: b.feature, b.date
     Buckets: 1024  Batches: 1  Memory Usage: 2769kB
     ->  Seq Scan on public.b  (cost=0.00..424.63 rows=6163 width=428) (actual time=0.003..1.156 rows=6163 loops=1)
           Output: b.feature, b.date
 Planning time: 1.041 ms
 Execution time: 49396.419 ms
(12 rows)

I tried advices from PostgreSQL query runs faster with index scan, but engine chooses hash join ,

but neither set random_page_cost = 2 nor set work_mem = '2048MB' can have any effect: plan and times remains the same.

I see some article says that bitmap index scan could do better, but I don't know how to create such index: It seems that postgres is the one who decide if it should be generated when querying something.

Additional information:

about resource usage:

query takes only one logical core to work, and I have enough memory(32GiB, when database size <6GiB)

`char()&varchar()&float, have clustered b-tree index on date char(11)` is not a readable description of your tables. Please add the real DDL. — wildplasser, Mar 06 '16 at 15:31

score 4 · Accepted Answer · answered Mar 06 '16 at 15:57

No index will be used on this query because all 7M rows of public.a are being used by the query. Only when you make a more restrictive filter such that only a small fraction of the rows in public.a are necessary to run the query will you see that the index gets used.

If you look at the actual run times of the query you will note that the sequential scan take less than 1 second (967.771 ms, to be precise) while the join takes up about 48 seconds.

How to perform fast join on postgresql

1 Answers1