2

When joining two tables via a composite (two column) primary key, I get bad cardinality estimates in the query plan. Example:

CREATE TABLE t1 AS SELECT x, x*2 AS x2 FROM generate_series(0, 1000) AS x;
ALTER TABLE t1 ADD PRIMARY KEY(x, x2);
ANALYZE t1;

CREATE TABLE t2 AS SELECT x, x*2 AS x2 FROM generate_series(0, 1000) AS x;
ALTER TABLE t2 ADD FOREIGN KEY (x, x2) REFERENCES t1(x,x2);
ANALYZE t2;

EXPLAIN ANALYZE
SELECT *
FROM t1 JOIN t2 USING (x, x2)

 QUERY PLAN                                                                                                    
 ------------------------------------------------------------------------------------------------------------- 
 Hash Join  (cost=30.02..52.55 rows=1 width=8) (actual time=0.660..1.551 rows=1001 loops=1)                    
   Hash Cond: ((t1.x = t2.x) AND (t1.x2 = t2.x2))                                                              
   ->  Seq Scan on t1  (cost=0.00..15.01 rows=1001 width=8) (actual time=0.021..0.260 rows=1001 loops=1)       
   ->  Hash  (cost=15.01..15.01 rows=1001 width=8) (actual time=0.620..0.620 rows=1001 loops=1)                
         Buckets: 1024  Batches: 1  Memory Usage: 40kB                                                         
         ->  Seq Scan on t2  (cost=0.00..15.01 rows=1001 width=8) (actual time=0.019..0.230 rows=1001 loops=1) 
 Total runtime: 1.679 ms    

The plan expects one returned row, but in fact 1001 rows are returned. This is not a problem in simple queries, but it leads to very slow query plans when doing complex queries. How can I help the query optimizer to do better?

Karl Bartel
  • 3,244
  • 1
  • 29
  • 28
  • The problem is that the primary key and foreign key are over-specified: for both tables x2 is fully functional dependent on x1. This confuses the optimzer; it can not know the implied dependence. Joining on only `t1.x = t2.x` gives the correct estimate. Also `CREATE TABLE t1 AS SELECT x/100 AS x, x%100 AS x2 FROM generate_series(0, 10000) AS x;` (same for t2) gives the correct estimates. – joop Sep 26 '13 at 16:15
  • @joop The db could know that the join will not discard any rows, since the foreign key is used for the join. – Karl Bartel Sep 26 '13 at 16:17
  • I am aware of that. But it could be that the optimiser assumes that each key element adds entropy to the keyspace(probably a bad choice/order of heuristics). BTW: adding a primary key to t2 does not help. My div/mod trick does change the expected number of rows. – joop Sep 26 '13 at 16:32
  • If postresql allows it, you could use something else for your primary key and put a unique index on the combination of those two keys. – Dan Bracuk Sep 26 '13 at 16:54

2 Answers2

1

Using a composite primary key in which one column is completely dependent on the other column is an "interesting" design.

In any case, PostgreSQL currently assumes each column's selectivity is independent of each other and so multiplies them together (regardless of whether they are in the same index or not, even when it is a primary key index), and I don't know of a good way around that.

You can use this circumlocution to get closer to the true selectivity:

EXPLAIN ANALYZE
SELECT *
FROM t1 JOIN t2 on (t1.x=t2.x and t1.x2 between t2.x2 and t2.x2);
jjanes
  • 37,812
  • 5
  • 27
  • 34
  • The second column is not really meant to be dependent on the first column. I just could not think of a better way to create the test data. Suggestions welcome! – Karl Bartel Sep 26 '13 at 16:22
0

Another way to create truly orthogonal key-elements:

CREATE TABLE t1 AS SELECT x/100 AS x, x%100 AS x2 FROM generate_series(0, 10000) AS x;
ALTER TABLE t1 ADD PRIMARY KEY(x, x2);
ANALYZE t1;

CREATE TABLE t2 AS SELECT x/100 AS x, x%100 AS x2 FROM generate_series(0, 10000) AS x;
ALTER TABLE t2 ADD PRIMARY KEY (x, x2) ; -- added PK
ALTER TABLE t2 ADD FOREIGN KEY (x, x2) REFERENCES t1(x,x2);

ANALYZE t2;
joop
  • 4,330
  • 1
  • 15
  • 26