6

I have PostgreSQL 9.5.9 and two tables: table1 and table2

 Column   |              Type              |                 Modifiers                 
------------+--------------------------------+-------------------------------------------
 id         | integer                        | not null
 status     | integer                        | not null
 table2_id  | integer                        | 
 start_date | timestamp(0) without time zone | default NULL::timestamp without time zone
Indexes:
    "table1_pkey" PRIMARY KEY, btree (id)
    "table1_start_date" btree (start_date)
    "table1_table2" btree (table2_id)
Foreign-key constraints:
    "fk_t1_t2" FOREIGN KEY (table2_id) REFERENCES table2(id)


 Column |          Type           |            Modifiers            
--------+-------------------------+---------------------------------
 id     | integer                 | not null
 name   | character varying(2000) | default NULL::character varying
Indexes:
    "table2_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "table1" CONSTRAINT "fk_t1_t2" FOREIGN KEY (table2_id) REFERENCES table2(id)

table2 contains only 3 rows; table1 contains about 400000 rows and only half of them have some value in table_2_id column.

The query is fast enough when i select some values from table1 ordered by start_date column because table1_start_date index is effectively used:

SELECT t1.*
FROM table1 AS t1 
ORDER BY t1.start_date DESC
LIMIT 25 OFFSET 150000;

EXPLAIN ANALYZE result

   Limit  (cost=7797.40..7798.70 rows=25 width=20) (actual time=40.994..41.006 rows=25 loops=1)
   ->  Index Scan Backward using table1_start_date on table1 t1  (cost=0.42..20439.74 rows=393216 width=20) (actual time=0.078..36.251 rows=150025
 loops=1)
 Planning time: 0.097 ms
 Execution time: 41.033 ms

But the query become very slow when i add LEFT JOIN to fetch values from table2 too:

SELECT t1.*, t2.*
FROM table1 AS t1
LEFT JOIN table2 AS t2 ON t2.id = t1.table2_id
ORDER BY t1.start_date DESC
LIMIT 25 OFFSET 150000;

EXPLAIN ANALYZE result:

 Limit  (cost=33690.80..33696.42 rows=25 width=540) (actual time=191.282..191.320 rows=25 loops=1)
   ->  Nested Loop Left Join  (cost=0.57..88317.50 rows=393216 width=540) (actual time=0.028..184.537 rows=150025 loops=1)
         ->  Index Scan Backward using table1_start_date on table1 t1  (cost=0.42..20439.74 rows=393216 width=20) (actual time=0.018..35.196 rows=
150025 loops=1)
         ->  Index Scan using table2_pkey on table2 t2  (cost=0.14..0.16 rows=1 width=520) (actual time=0.000..0.001 rows=1 loops=150025)
               Index Cond: (id = t1.table2_id)
 Planning time: 0.210 ms
 Execution time: 191.357 ms

Why query time increased from 32ms to 191ms? As i understand, LEFT JOIN can not affect to result. So, we can select 25 rows from table1 (LIMIT 25) first and then join rows from table2 Execution time of the query shouldn't significantly increase. There are no some tricky conditions which can break the use of index, etc.

I don't understand completely the EXPLAIN ANALYZE for second query, but it seems like postgres analyzer decided to "perform join and then filter" instead of "filter and then join". In this way the query is too slow. What is the problem?

wildplasser
  • 43,142
  • 8
  • 66
  • 109
ox160d05d
  • 600
  • 4
  • 8
  • 1
    _Columns_, not fields. – jarlh Nov 22 '17 at 10:33
  • As a LEFT join and by the fact you're not selecting any columns of table2, PostgreSQL consider that you don't need table2 and simply ignore it. – Michel Milezzi Nov 22 '17 at 10:37
  • Thank you for notice, i've edited the question. There is no join in the first question. – ox160d05d Nov 22 '17 at 10:40
  • If you are only interested in execution speed, you could try replacing the left join on t2 with a scalar subquery. – wildplasser Nov 22 '17 at 11:30
  • No, subquery is obvious solution for me in such cases. Actually i'm interested in understanding psql query analyzer. – ox160d05d Nov 22 '17 at 11:36
  • Maybe a better plan was generated, but not selected. There is an option to dump *all* generated plans to the logfile, before picking the *optimal* one. Also note that the query is relatively small, big-O estimation doesn't work well for small N. (the importance of the constants is too large) – wildplasser Nov 22 '17 at 11:47

1 Answers1

2

It just doesn't know that limit should apply to table1 instead of result of join, so it fetches minimum required rows, that is 150025 and then does 150025 loops on table2. If you do subselect with limit on table1 and join table2 to that subselect you should get what you want.

SELECT t1.*, t2.*
FROM (SELECT *
        FROM table1
       ORDER BY start_date DESC
       LIMIT 25 OFFSET 150000) AS t1
LEFT JOIN table2 AS t2 ON t2.id = t1.table2_id;
Łukasz Kamiński
  • 5,630
  • 1
  • 19
  • 32
  • Hm. For me it is clear enough that limit should be applied to table1: 1) because LEFT JOIN is used (not INNER JOIN) 2) because we use forign key from table1, so each table1 row can be joined with 0 or 1 table2 row. In this way join result will not affect to the amount of selected rows. Is not it? Isn't analyzer clever enough? :) – ox160d05d Nov 22 '17 at 11:12
  • @ox160d05d OFFSET+LIMIT is applied to the *result* of the complete query. (that's the reason why it comes after ORDER BY and GROUP BY in the syntax) – wildplasser Nov 22 '17 at 11:15
  • @ox160d05d It isn't one to one relation, so JOIN (even LEFT) can change number of rows from result. Consider how many rows you would get if you would flip tables and on right side would be `table2` with just a few rows and left joined `table1` with thousands of matching rows. Would it return just a few rows or thousands? – Łukasz Kamiński Nov 22 '17 at 11:39
  • Thank you for answer. It is clear about "table2 join table1" case - we don't know how many rows will be joined, so we forced to do join first and then limit. But in "table1 join table2" case we **know**, that the LEFT JOIN will not affect to the number of rows (table2_id is foreign key and it is related to table2.id which is primary and unique key fro table2) Actually it is 1 to 0..1 relation – ox160d05d Nov 22 '17 at 11:42
  • @ŁukaszKamiński in this particular case, only 0 or 1 rows from t2 will match a row from t1. So the rowcount for the result will be the same. – wildplasser Nov 22 '17 at 11:50
  • 1
    Yeah, you are right. I guess I don't have answer to that then. I suppose it just isn't optimized enough. – Łukasz Kamiński Nov 22 '17 at 11:50