I have 2 tables. Their structure is roughly as follows; I've changed the names, though.
CREATE TABLE overlay_polygon
(
overlay_polygon_id SERIAL PRIMARY KEY,
some_other_polygon_id INTEGER REFERENCES some_other_polygon (some_other_polygon_id)
dollar_value NUMERIC,
geom GEOMETRY(Polygon,26915)
)
CREATE TABLE point
(
point_id SERIAL PRIMARY KEY,
some_other_polygon_id INTEGER REFERENCES some_other_polygon (some_other_polygon_id)
-- A bunch of other fields that this query won't touch
geom GEOMETRY(Point,26915)
)
point
has a spatial index on its geom
column, named spix_point
, and an index on its some_other_polygon_id
column as well.
There are approximately 500,000 rows in point
, and almost all of the rows in point
intersect with some row in overlay_polygon
. Originally, my overlay_polygon
table contained several rows that had very small areas (less than 1 square meter, mostly) and did not spatially intersect any rows from point
. After deleting the small rows that do not intersect any rows in point
, there are 38 rows.
As the name implies, overlay_polygon
is a table of polygons generated as the result of the overlay of polygons from 3 other tables (including some_other_polygon
). In particular, I need to make some computations using dollar_value
and some columns on point
. When I set out to delete the rows that intersected no points to speed up processing in the future, I ended up querying on the COUNT of rows. The most obvious query seemed to be the following.
SELECT op.*, COUNT(point_id) AS num_points
FROM overlay_polygon op
LEFT JOIN point ON op.some_other_polygon_id = point.some_other_polygon_id AND ST_Intersects(op.geom, point.geom)
GROUP BY op.overlay_polygon_id
ORDER BY op.overlay_polygon_id
;
Here is its EXPLAIN (ANALYZE, BUFFERS)
.
GroupAggregate (cost=544.45..545.12 rows=38 width=8049) (actual time=284962.944..540959.914 rows=38 loops=1)
Buffers: shared hit=58694 read=17119, temp read=189483 written=189483
I/O Timings: read=39171.525
-> Sort (cost=544.45..544.55 rows=38 width=8049) (actual time=271754.952..534154.573 rows=415224 loops=1)
Sort Key: op.overlay_polygon_id
Sort Method: external merge Disk: 897016kB
Buffers: shared hit=58694 read=17119, temp read=189483 written=189483
I/O Timings: read=39171.525
-> Nested Loop Left Join (cost=0.00..543.46 rows=38 width=8049) (actual time=0.110..46755.284 rows=415224 loops=1)
Buffers: shared hit=58694 read=17119
I/O Timings: read=39171.525
-> Seq Scan on overlay_polygon op (cost=0.00..11.38 rows=38 width=8045) (actual time=0.043..153.255 rows=38 loops=1)
Buffers: shared hit=1 read=10
I/O Timings: read=152.866
-> Index Scan using spix_point on point (cost=0.00..13.99 rows=1 width=200) (actual time=50.229..1139.868 rows=10927 loops=38)
Index Cond: (op.geom && geom)
Filter: ((op.some_other_polygon_id = some_other_polygon_id) AND _st_intersects(op.geom, geom))
Rows Removed by Filter: 13353
Buffers: shared hit=58693 read=17109
I/O Timings: read=39018.660
Total runtime: 542172.156 ms
However, I found that this query ran much, much more quickly:
SELECT *
FROM overlay_polygon
JOIN (SELECT op.overlay_polygon_id, COUNT(point_id) AS num_points
FROM overlay_polygon op
LEFT JOIN point ON op.some_other_polygon_id = point.some_other_polygon_id AND ST_Intersects(op.geom, point.geom)
GROUP BY op.overlay_polygon_id
) x ON x.overlay_polygon_id = overlay_polygon.overlay_polygon_id
ORDER BY overlay_polygon.overlay_polygon_id
;
Its EXPLAIN (ANALYZE, BUFFERS)
is below.
Sort (cost=557.78..557.88 rows=38 width=8057) (actual time=18904.661..18904.748 rows=38 loops=1)
Sort Key: overlay_polygon.overlay_polygon_id
Sort Method: quicksort Memory: 126kB
Buffers: shared hit=58690 read=17134
I/O Timings: read=9924.328
-> Hash Join (cost=544.88..556.78 rows=38 width=8057) (actual time=18903.697..18904.210 rows=38 loops=1)
Hash Cond: (overlay_polygon.overlay_polygon_id = op.overlay_polygon_id)
Buffers: shared hit=58690 read=17134
I/O Timings: read=9924.328
-> Seq Scan on overlay_polygon (cost=0.00..11.38 rows=38 width=8045) (actual time=0.127..0.411 rows=38 loops=1)
Buffers: shared hit=2 read=9
I/O Timings: read=0.173
-> Hash (cost=544.41..544.41 rows=38 width=12) (actual time=18903.500..18903.500 rows=38 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 2kB
Buffers: shared hit=58688 read=17125
I/O Timings: read=9924.154
-> HashAggregate (cost=543.65..544.03 rows=38 width=8) (actual time=18903.276..18903.379 rows=38 loops=1)
Buffers: shared hit=58688 read=17125
I/O Timings: read=9924.154
-> Nested Loop Left Join (cost=0.00..543.46 rows=38 width=8) (actual time=0.052..17169.606 rows=415224 loops=1)
Buffers: shared hit=58688 read=17125
I/O Timings: read=9924.154
-> Seq Scan on overlay_polygon op (cost=0.00..11.38 rows=38 width=8038) (actual time=0.004..0.537 rows=38 loops=1)
Buffers: shared hit=1 read=10
I/O Timings: read=0.279
-> Index Scan using spix_point on point (cost=0.00..13.99 rows=1 width=200) (actual time=4.422..381.991 rows=10927 loops=38)
Index Cond: (op.gopm && gopm)
Filter: ((op.some_other_polygon_id = some_other_polygon_id) AND _st_intersects(op.geom, geom))
Rows Removed by Filter: 13353
Buffers: shared hit=58687 read=17115
I/O Timings: read=9923.875
Total runtime: 18905.293 ms
As you can see, they have comparable cost estimates, although I'm not sure how accurate those cost estimates are. I am dubious about cost estimates that involve PostGIS functions. Both tables have had VACUUM ANALYZE FULL
run on them since they were last modified and before running the queries.
Perhaps I am simply unable to read my EXPLAIN ANALYZE
s, but I cannot see why these queries have such drastically different run times. Can anyone identify anything? The only possibility I can think of is related to the number of columns involved in the LEFT JOIN
.
EDIT 1
Per the suggestion of @ChrisTravers, I increases work_mem
and reran the first query. I don't believe this represents a significant improvement.
Executed
SET work_mem='4MB';
(It was 1 MB.)
Then executing the first query gave these results.
GroupAggregate (cost=544.45..545.12 rows=38 width=8049) (actual time=339910.046..495775.478 rows=38 loops=1)
Buffers: shared hit=58552 read=17261, temp read=112133 written=112133
-> Sort (cost=544.45..544.55 rows=38 width=8049) (actual time=325391.923..491329.208 rows=415224 loops=1)
Sort Key: op.overlay_polygon_id
Sort Method: external merge Disk: 896904kB
Buffers: shared hit=58552 read=17261, temp read=112133 written=112133
-> Nested Loop Left Join (cost=0.00..543.46 rows=38 width=8049) (actual time=14.698..234266.573 rows=415224 loops=1)
Buffers: shared hit=58552 read=17261
-> Seq Scan on overlay_polygon op (cost=0.00..11.38 rows=38 width=8045) (actual time=14.612..15.384 rows=38 loops=1)
Buffers: shared read=11
-> Index Scan using spix_point on point (cost=0.00..13.99 rows=1 width=200) (actual time=95.262..5451.636 rows=10927 loops=38)
Index Cond: (op.geom && geom)
Filter: ((op.some_other_polygon_id = some_other_polygon_id) AND _st_intersects(op.geom, geom))
Rows Removed by Filter: 13353
Buffers: shared hit=58552 read=17250
Total runtime: 496936.775 ms
EDIT 2
Well, here's a nice, big smell that I didn't notice before (largely because I have trouble reading the ANALYZE
output). Sorry I didn't notice it sooner.
Sort (cost=544.45..544.55 rows=38 width=8049) (actual time=271754.952..534154.573 rows=415224 loops=1)
Estimated rows: 38. Actual rows: over 400K. Ideas, anyone?