I have a weird problem with postgresql. Somewhy planner thinks that accessing data via foreign key index is very slow and uses sequental scan.
There's a simplified table structure:
create table flight
(
id bigint not null constraint flight_pkey primary key,
parent_id bigint not null constraint flight_parent_fkey references flight
);
create table passenger
(
id bigint not null constraint passenger_pkey primary key,
flight_id bigint not null constraint pax_flight_fkey references flight
);
create index pax_flight_id on passenger (flight_id);
Table passenger
has about 14 million rows and table flight has around 60 thousands.
The problem is, I have a QueryDSL query with lots of optional conditions, and one of them filters passengers by their flights.
QPassenger qPassenger = QPassenger.passenger;
Long flightId = 123456;
...
BooleanBuilder predicate = new BooleanBuilder();
predicate.and(qPassenger.flight.id.eq(flightId));
...
And when I try to fetch all the passengers that fit these optional conditions it generates query like one below, and it executes for the whole thirty seconds. That's terrifying. It somewhy uses sequental scan and hash join over all the passenger
table.
select
passenger0_.id as id1_13_,
passenger0_.flight_id as flight_14_13_
from
passenger passenger0_ cross join flights flight1_
where
passenger0_.flight_id = 123456
or flight1_.parent_id = 123456
However, after a day of solution-seeking I found that using primary key of the flight
table will make postgres use primary key index:
select
passenger0_.id as id1_13_,
passenger0_.flight_id as flight_14_13_
from
passenger passenger0_ cross join flights flight1_
where
flight1_.id = 123456 -- ←this line!
or flight1_.parent_id = 123456
Unfortunately, I cannot filter rows I received by hands, as it will result in ~13 millions of rows, opposite to ~300 passengers per flight when flightId
is set.
➥ So, my question is: is there any way to tell QueryDSL/Hibernate to use the specific column for this condition? I.e. flight.id
, not passenger.flight_id
.
Or, another question: what's wrong with my PostgreSQL planner and how can I fix it?
UPD Planner's plans:
- Good query that uses primary key in WHERE-condition:
EXPLAIN ANALYZE
SELECT *
FROM passenger p JOIN flights f ON p.flight_id = f.id
WHERE (f.id = 123456
OR f.parent_id = 123456);
QUERY PLAN
Nested Loop (cost=3.66..1759.26 rows=310 width=951) (actual time=0.044..0.242 rows=184 loops=1)
-> Bitmap Heap Scan on flights f (cost=3.10..9.78 rows=6 width=240) (actual time=0.024..0.029 rows=4 loops=1)
Recheck Cond: ((id = 123456) OR (parent_id = 123456))
Heap Blocks: exact=4
-> BitmapOr (cost=3.10..3.10 rows=6 width=0) (actual time=0.018..0.018 rows=0 loops=1)
-> Bitmap Index Scan on flights_pkey (cost=0.00..1.53 rows=1 width=0) (actual time=0.008..0.008 rows=1 loops=1)
Index Cond: (id = 123456)
-> Bitmap Index Scan on flt_parent_id_index (cost=0.00..1.56 rows=5 width=0) (actual time=0.009..0.009 rows=3 loops=1)
Index Cond: (parent_id = 123456)
-> Index Scan using passenger_flight_id on passenger p (cost=0.56..286.73 rows=485 width=711) (actual time=0.005..0.023 rows=46 loops=4)
Index Cond: (flight_id = f.id)
Planning Time: 0.566 ms
Execution Time: 0.321 ms
- Bad query that uses foreign key in WHERE-condition:
EXPLAIN ANALYZE
SELECT *
FROM passenger p JOIN flights f ON p.flight_id = f.id
WHERE (p.flight_id = 123456
OR f.parent_id = 123456);
QUERY PLAN
Gather (cost=34194.96..3461993.92 rows=734 width=951) (actual time=79878.815..80711.129 rows=184 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Hash Join (cost=33194.96..3460920.52 rows=306 width=951) (actual time=71883.434..72044.345 rows=61 loops=3)
Hash Cond: (p.flight_id = f.id)
Join Filter: ((p.flight_id = 123456) OR (f.parent_id = 123456))
Rows Removed by Join Filter: 11206038
-> Parallel Seq Scan on passenger p (cost=0.00..827052.82 rows=14216282 width=711) (actual time=20.021..27298.757 rows=11206100 loops=3)
-> Parallel Hash (cost=20891.65..20891.65 rows=275065 width=240) (actual time=1284.916..1284.917 rows=219796 loops=3)
Buckets: 8192 Batches: 128 Memory Usage: 1248kB
-> Parallel Seq Scan on flights f (cost=0.00..20891.65 rows=275065 width=240) (actual time=2.134..966.560 rows=219796 loops=3)
Planning Time: 0.605 ms
Execution Time: 80711.774 ms