11

I have a DELETE query that I need to run on PostgreSQL 9.0.4. I am finding that it is performant until it hits 524,289 rows in a subselect query.

For instance, at 524,288 there is no materialized view used and the cost looks pretty good:

explain DELETE FROM table1 WHERE pointLevel = 0 AND userID NOT IN
(SELECT userID FROM table2 fetch first 524288 rows only);
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Delete  (cost=13549.49..17840.67 rows=21 width=6)
   ->  Index Scan using jslps_userid_nopt on table1  (cost=13549.49..17840.67 rows=21 width=6)
         Filter: ((NOT (hashed SubPlan 1)) AND (pointlevel = 0))
         SubPlan 1
           ->  Limit  (cost=0.00..12238.77 rows=524288 width=8)
                 ->  Seq Scan on table2  (cost=0.00..17677.92 rows=757292 width=8)
(6 rows)

However, as soon as I hit 524,289, the materialized view comes into play and the DELETE query becomes much more costly:

explain DELETE FROM table1 WHERE pointLevel = 0 AND userID NOT IN
(SELECT userID FROM table2 fetch first 524289 rows only);

  QUERY PLAN

-----------------------------------------------------------------------------------------------------------  
Delete  (cost=0.00..386910.33 rows=21 width=6)
    ->  Index Scan using jslps_userid_nopt on table1  (cost=0.00..386910.33 rows=21 width=6)
         Filter: ((pointlevel = 0) AND (NOT (SubPlan 1)))
         SubPlan 1
           ->  Materialize  (cost=0.00..16909.24 rows=524289 width=8)
                 ->  Limit  (cost=0.00..12238.79 rows=524289 width=8)
                       ->  Seq Scan on table2  (cost=0.00..17677.92 rows=757292 width=8) (7 rows)

I worked around the issue by using a JOIN in the sub-select query instead:

SELECT s.userid 
FROM table1 s 
LEFT JOIN table2 p ON s.userid=p.userid
WHERE p.userid IS NULL AND s.pointlevel=0

However, I am still interested in understanding why the materialize decreases performance so drastically.

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
cizer
  • 213
  • 2
  • 7

1 Answers1

9

My guess is that at rows=524289 the memory buffer is filled up, so the subquery has to be materialized on the disk. Hence the dramatic increase in the time needed.

Here you can read more about configuring the memory buffers: http://www.postgresql.org/docs/9.1/static/runtime-config-resource.html
If you play with work_mem you will see the difference in the query behavior.

However using join in the subquery is much better way to speed the query, since you are limiting the number of the rows at the source itself vs simply selecting first XYZ rows and then performing checks.

Antoan Milkov
  • 2,152
  • 17
  • 30