Querying directly is much, much slower than subquery with join

Question

I have 2 tables. Their structure is roughly as follows; I've changed the names, though.

CREATE TABLE overlay_polygon
(
  overlay_polygon_id SERIAL PRIMARY KEY,
  some_other_polygon_id INTEGER REFERENCES some_other_polygon (some_other_polygon_id)
  dollar_value NUMERIC,
  geom GEOMETRY(Polygon,26915)
)

CREATE TABLE point
(
  point_id SERIAL PRIMARY KEY,
  some_other_polygon_id INTEGER REFERENCES some_other_polygon (some_other_polygon_id)
  -- A bunch of other fields that this query won't touch
  geom GEOMETRY(Point,26915)
)

point has a spatial index on its geom column, named spix_point, and an index on its some_other_polygon_id column as well.

There are approximately 500,000 rows in point, and almost all of the rows in point intersect with some row in overlay_polygon. Originally, my overlay_polygon table contained several rows that had very small areas (less than 1 square meter, mostly) and did not spatially intersect any rows from point. After deleting the small rows that do not intersect any rows in point, there are 38 rows.

As the name implies, overlay_polygon is a table of polygons generated as the result of the overlay of polygons from 3 other tables (including some_other_polygon). In particular, I need to make some computations using dollar_value and some columns on point. When I set out to delete the rows that intersected no points to speed up processing in the future, I ended up querying on the COUNT of rows. The most obvious query seemed to be the following.

SELECT op.*, COUNT(point_id) AS num_points
FROM overlay_polygon op
LEFT JOIN point ON op.some_other_polygon_id = point.some_other_polygon_id AND ST_Intersects(op.geom, point.geom)
GROUP BY op.overlay_polygon_id
ORDER BY op.overlay_polygon_id
;

Here is its EXPLAIN (ANALYZE, BUFFERS).

GroupAggregate  (cost=544.45..545.12 rows=38 width=8049) (actual time=284962.944..540959.914 rows=38 loops=1)
  Buffers: shared hit=58694 read=17119, temp read=189483 written=189483
  I/O Timings: read=39171.525
  ->  Sort  (cost=544.45..544.55 rows=38 width=8049) (actual time=271754.952..534154.573 rows=415224 loops=1)
        Sort Key: op.overlay_polygon_id
        Sort Method: external merge  Disk: 897016kB
        Buffers: shared hit=58694 read=17119, temp read=189483 written=189483
        I/O Timings: read=39171.525
        ->  Nested Loop Left Join  (cost=0.00..543.46 rows=38 width=8049) (actual time=0.110..46755.284 rows=415224 loops=1)
              Buffers: shared hit=58694 read=17119
              I/O Timings: read=39171.525
              ->  Seq Scan on overlay_polygon op  (cost=0.00..11.38 rows=38 width=8045) (actual time=0.043..153.255 rows=38 loops=1)
                    Buffers: shared hit=1 read=10
                    I/O Timings: read=152.866
              ->  Index Scan using spix_point on point  (cost=0.00..13.99 rows=1 width=200) (actual time=50.229..1139.868 rows=10927 loops=38)
                    Index Cond: (op.geom && geom)
                    Filter: ((op.some_other_polygon_id = some_other_polygon_id) AND _st_intersects(op.geom, geom))
                    Rows Removed by Filter: 13353
                    Buffers: shared hit=58693 read=17109
                    I/O Timings: read=39018.660
Total runtime: 542172.156 ms

However, I found that this query ran much, much more quickly:

SELECT *
FROM overlay_polygon
JOIN (SELECT op.overlay_polygon_id, COUNT(point_id) AS num_points
      FROM overlay_polygon op
      LEFT JOIN point ON op.some_other_polygon_id = point.some_other_polygon_id AND ST_Intersects(op.geom, point.geom)
      GROUP BY op.overlay_polygon_id
     ) x ON x.overlay_polygon_id = overlay_polygon.overlay_polygon_id
ORDER BY overlay_polygon.overlay_polygon_id
;

Its EXPLAIN (ANALYZE, BUFFERS) is below.

Sort  (cost=557.78..557.88 rows=38 width=8057) (actual time=18904.661..18904.748 rows=38 loops=1)
  Sort Key: overlay_polygon.overlay_polygon_id
  Sort Method: quicksort  Memory: 126kB
  Buffers: shared hit=58690 read=17134
  I/O Timings: read=9924.328
  ->  Hash Join  (cost=544.88..556.78 rows=38 width=8057) (actual time=18903.697..18904.210 rows=38 loops=1)
        Hash Cond: (overlay_polygon.overlay_polygon_id = op.overlay_polygon_id)
        Buffers: shared hit=58690 read=17134
        I/O Timings: read=9924.328
        ->  Seq Scan on overlay_polygon  (cost=0.00..11.38 rows=38 width=8045) (actual time=0.127..0.411 rows=38 loops=1)
              Buffers: shared hit=2 read=9
              I/O Timings: read=0.173
        ->  Hash  (cost=544.41..544.41 rows=38 width=12) (actual time=18903.500..18903.500 rows=38 loops=1)
              Buckets: 1024  Batches: 1  Memory Usage: 2kB
              Buffers: shared hit=58688 read=17125
              I/O Timings: read=9924.154
              ->  HashAggregate  (cost=543.65..544.03 rows=38 width=8) (actual time=18903.276..18903.379 rows=38 loops=1)
                    Buffers: shared hit=58688 read=17125
                    I/O Timings: read=9924.154
                    ->  Nested Loop Left Join  (cost=0.00..543.46 rows=38 width=8) (actual time=0.052..17169.606 rows=415224 loops=1)
                          Buffers: shared hit=58688 read=17125
                          I/O Timings: read=9924.154
                          ->  Seq Scan on overlay_polygon op  (cost=0.00..11.38 rows=38 width=8038) (actual time=0.004..0.537 rows=38 loops=1)
                                Buffers: shared hit=1 read=10
                                I/O Timings: read=0.279
                          ->  Index Scan using spix_point on point  (cost=0.00..13.99 rows=1 width=200) (actual time=4.422..381.991 rows=10927 loops=38)
                                Index Cond: (op.gopm && gopm)
                                Filter: ((op.some_other_polygon_id = some_other_polygon_id) AND _st_intersects(op.geom, geom))
                                Rows Removed by Filter: 13353
                                Buffers: shared hit=58687 read=17115
                                I/O Timings: read=9923.875
Total runtime: 18905.293 ms

As you can see, they have comparable cost estimates, although I'm not sure how accurate those cost estimates are. I am dubious about cost estimates that involve PostGIS functions. Both tables have had VACUUM ANALYZE FULL run on them since they were last modified and before running the queries.

Perhaps I am simply unable to read my EXPLAIN ANALYZEs, but I cannot see why these queries have such drastically different run times. Can anyone identify anything? The only possibility I can think of is related to the number of columns involved in the LEFT JOIN.

EDIT 1

Per the suggestion of @ChrisTravers, I increases work_mem and reran the first query. I don't believe this represents a significant improvement.

Executed

SET work_mem='4MB';

(It was 1 MB.)

Then executing the first query gave these results.

GroupAggregate  (cost=544.45..545.12 rows=38 width=8049) (actual time=339910.046..495775.478 rows=38 loops=1)
  Buffers: shared hit=58552 read=17261, temp read=112133 written=112133
  ->  Sort  (cost=544.45..544.55 rows=38 width=8049) (actual time=325391.923..491329.208 rows=415224 loops=1)
        Sort Key: op.overlay_polygon_id
        Sort Method: external merge  Disk: 896904kB
        Buffers: shared hit=58552 read=17261, temp read=112133 written=112133
        ->  Nested Loop Left Join  (cost=0.00..543.46 rows=38 width=8049) (actual time=14.698..234266.573 rows=415224 loops=1)
              Buffers: shared hit=58552 read=17261
              ->  Seq Scan on overlay_polygon op  (cost=0.00..11.38 rows=38 width=8045) (actual time=14.612..15.384 rows=38 loops=1)
                    Buffers: shared read=11
              ->  Index Scan using spix_point on point  (cost=0.00..13.99 rows=1 width=200) (actual time=95.262..5451.636 rows=10927 loops=38)
                    Index Cond: (op.geom && geom)
                    Filter: ((op.some_other_polygon_id = some_other_polygon_id) AND _st_intersects(op.geom, geom))
                    Rows Removed by Filter: 13353
                    Buffers: shared hit=58552 read=17250
Total runtime: 496936.775 ms

EDIT 2

Well, here's a nice, big smell that I didn't notice before (largely because I have trouble reading the ANALYZE output). Sorry I didn't notice it sooner.

Sort  (cost=544.45..544.55 rows=38 width=8049) (actual time=271754.952..534154.573 rows=415224 loops=1)

Estimated rows: 38. Actual rows: over 400K. Ideas, anyone?

Chris Travers · Accepted Answer · 2013-02-28T05:44:12.920

2

My immediate thought is that this may have to do with work_mem limits. The difference between the plans is that in the first, you join and then aggregate, and in the second you aggregate and the join. This means your aggregation set narrower, which means less memory is used over that operation.

It would be interesting to see what would change if you tried doubling work_mem and trying again.

Edit: Now that we know increasing the work_mem leads to only a modest improvement, the next issue is the sort row estimate. I suspect it is in fact exceeding work_mem here and it is expecting this will be easy because it expects only 38 rows out, but instead gets a lot of rows out. It isn't clear to me where the planner is getting this information because it's pretty clear that the planner estimates (correctly) that 38 rows is the number of rows we expect out of the aggregate. This part is starting to look like a planner bug to me, but I am having a hard time putting my finger on it. It might be worth writing up and bringing up on the pgsql-general email list. It almost looks to me like the planner is getting confused between memory required for the sort and memory required for the aggregate.

edited Feb 28 '13 at 05:44

answered Feb 28 '13 at 03:56

Chris Travers

25,424
6
65
182

...That actually makes a lot of sense. Could that cause problems with ST_Union for a large number of geometries by any chance? I have work_mem at the default setting currently. I'll see what happens if I increase it. – jpmc26 Feb 28 '13 at 04:21
I don't think that helped. I actually quadrupled `work_mem` in the current session. See the edit in my question. – jpmc26 Feb 28 '13 at 04:53
I'm reading more about performance, and based on the line `Sort Method: external merge Disk: 896904kB`, I'm going to say I'd need `work_mem` to be about 900 MB for it to make a difference. (My machine is a micro instance on Amazon with only 615 MB of RAM, created specifically for development.) Any idea why it's taking up so much memory? That would seem to be a likely source of the speed problems. – jpmc26 Feb 28 '13 at 05:05
Updated my answer. It looks to me like the planner is getting confused between the sort memory required and the aggregate memory required. It might be worth bringing this up on PostgreSQL email lists and see if anyone has a clue what can be done to fix that. – Chris Travers Feb 28 '13 at 05:45
I think what you describe with the sort is the problem. Because it expects the number of rows to be small, it thinks it can do the sort in memory and then (I assume) do the aggregate faster. In reality, it gets over 400,000 rows to sort and resorts to an on disk merge sort. It makes sense for sorts to be more expensive that aggregates for large data sets. Even ignoring the disk bottleneck, a merge sort is `O(n*log(n))`, iirc, while a `COUNT` aggregate should be at worst linear. The bad index estimate is a PostGIS problem; it's apparently not very good at making intersection estimates. – jpmc26 Aug 29 '13 at 23:46

vyegorov · Answer 2 · 2013-02-28T20:25:31.197

As outlined in your EDIT 2, indeed, there's a big mismatch between estimated and actual number of rows returned. But the source of the problem is a bit down the tree, here:

Index Scan using spix_point on point  (cost=0.00..13.99 rows=1 width=200) 
   (actual time=95.262..5451.636 rows=10927 loops=38)

This affects all the nodes the tree, Nested Loop and Sort.

I would try to do the following:

First, make sure stats are up to date:

VACUUM ANALYZE point;
VACUUM ANALYZE overlay_polygon;

If no luck, increase statistics target for your geometry columns:

ALTER TABLE point ALTER geom SET STATISTICS 500;
ALTER TABLE overlay_polygon ALTER geom SET STATISTICS 1500;

And then analyze the tables again.

IMHO, Nested Loops are not good here, Hash would be more appropriate. Try to issue:
```
SET enable_nestloop TO off;
```
on the session level and see whether it helps.

After looking a bit more on the query I think it is worth bumping up statistics target for the some_other_polygon_id column:

ALTER TABLE point ALTER some_other_polygon_id SET STATISTICS 5000;

Also, I see no reasons why your second query is so much faster then the first one. Am I correct saying that both queries were executed only once and on the “cold” database? It really feels, that second query took benefits of the OS filesystem cache and thus was executed much faster.

Using spix_point is a bad decision by the planner here, as point will be scanned in full to fullfil LEFT JOIN. Therefore one way to improve the query might be to force Seq Scan on this table. This can be done with the help of the CTE:

WITH p AS (SELECT point_id, some_other_polygon_id, geom FROM point)
SELECT op.*, COUNT(p.point_id) AS num_points
  FROM overlay_polygon op
  LEFT JOIN p ON op.some_other_polygon_id = p.some_other_polygon_id
       AND ST_Intersects(op.geom, p.geom)
 GROUP BY op.overlay_polygon_id
 ORDER BY op.overlay_polygon_id;

But this will have a slowdown in the materialization area. Still, give it a try.

No improvement. Disable nested loops made it resort to sequential scan on point and actually brought the runtime up to about 950 seconds, which approaches doubling the runtime. I'm wondering if this is some kind of PostGIS bug. The bad estimate seems to be related to the spatial index. — jpmc26, Feb 28 '13 at 18:00
I wasn't too careful about getting the OS filesystem cache to clear out, but if that were the case, wouldn't a repeated run on the first query also benefit from the OS caching the data in memory? Repeated runs of the first query aren't anywhere near as fast as the second query. I did note some improvement once, but that was more like down to 300 some seconds from 400 some seconds. — jpmc26, Feb 28 '13 at 20:19
@jpmc26, true. You haven't mentioned wether cache was warm or not, therefore I outlined this. — vyegorov, Feb 28 '13 at 20:27

Querying directly is much, much slower than subquery with join

EDIT 1

EDIT 2

2 Answers2