7

I have a case where I need to cleanup table from orphans regularly, so I'm looking for a high performance solution. I tried using 'IN' clause, but its not really fast. Columns have all the required indexes in both tables.(id - primary key, component_id - index, component_type - index)

DELETE FROM component_apportionment
WHERE id in (
        SELECT a.id
        FROM component_apportionment a
        LEFT JOIN component_live c
        ON (c.component_id = a.component_id
                AND
                c.component_type = a.component_type)
        WHERE c.id is null);

Basically the case is to remove records from 'component_apportionment' table which do not exist in 'component_live' table.

The query plan for the query above is horrible as well:

Delete on component_apportionment_copy1   (cost=3860927.55..3860929.09 rows=1 width=18) (actual  time=183479.848..183479.848 rows=0 loops=1)
->  Nested Loop  (cost=3860927.55..3860929.09 rows=1 width=18) (actual time=183479.811..183479.813 rows=1 loops=1)
    ->  HashAggregate  (cost=3860927.12..3860927.13 rows=1 width=20) (actual time=183479.793..183479.793 rows=1 loops=1)
          Group Key: a.id
          ->  Merge Right Join  (cost=3753552.72..3860927.12 rows=1 width=20) (actual time=172941.125..183479.787 rows=1 loops=1)
                Merge Cond: ((c.component_id = a.component_id) AND ((c.component_type)::text = (a.component_type)::text))
                Filter: (c.id IS NULL)
                Rows Removed by Filter: 5968195
                ->  Sort  (cost=3390767.32..3413658.29 rows=9156391 width=21) (actual time=169852.438..172642.897 rows=8043013 loops=1)
                      Sort Key: c.component_id, c.component_type
                      Sort Method: external merge  Disk: 310232kB
                      ->  Seq Scan on component_live c  (cost=0.00..2117393.91 rows=9156391 width=21) (actual time=0.004..155656.568 rows=9333382 loops=1)
                ->  Materialize  (cost=362785.40..375049.75 rows=2452871 width=21) (actual time=3088.653..5343.013 rows=5968195 loops=1)
                      ->  Sort  (cost=362785.40..368917.58 rows=2452871 width=21) (actual time=3088.648..3989.163 rows=2452871 loops=1)
                            Sort Key: a.component_id, a.component_type
                            Sort Method: external merge  Disk: 81504kB
                            ->  Seq Scan on component_apportionment_copy1 a  (cost=0.00..44969.71 rows=2452871 width=21) (actual time=0.920..882.040 rows=2452871 loops=1)
    ->  Index Scan using component_apportionment_copy1_pkey on component_apportionment_copy1  (cost=0.43..1.95 rows=1 width=14) (actual time=0.012..0.012 rows=1 loops=1)
          Index Cond: (id = a.id)
Planning time: 5.573 ms
Execution time: 183554.675 ms

Would appreciate any help. Thanks

Note

Tables have approx 80mln records each in a worst case. Both tables have indexes on used columns.

UPDATE

Query plan for 'not exists'

Query:

EXPLAIN (analyze, verbose, buffers) DELETE FROM component_apportionment_copy1
WHERE not exists (select 1
    from component_live c
    where c.component_id = component_apportionment_copy1.component_id);


Delete on vector.component_apportionment_copy1  (cost=2276557.80..2446287.39 rows=2104532 width=12) (actual time=203643.560..203643.560 rows=0 loops=1)
Buffers: shared hit=20875 read=2025400, temp read=46067 written=45813
->  Hash Anti Join  (cost=2276557.80..2446287.39 rows=2104532 width=12) (actual time=202212.975..203643.486 rows=1 loops=1)
    Output: component_apportionment_copy1.ctid, c.ctid
    Hash Cond: (component_apportionment_copy1.component_id = c.component_id)
    Buffers: shared hit=20874 read=2025400, temp read=46067 written=45813
    ->  Seq Scan on vector.component_apportionment_copy1  (cost=0.00..44969.71 rows=2452871 width=10) (actual time=0.003..659.668 rows=2452871 loops=1)
          Output: component_apportionment_copy1.ctid, component_apportionment_copy1.component_id
          Buffers: shared hit=20441
    ->  Hash  (cost=2117393.91..2117393.91 rows=9156391 width=10) (actual time=198536.786..198536.786 rows=9333382 loops=1)
          Output: c.ctid, c.component_id
          Buckets: 16384  Batches: 128  Memory Usage: 3195kB
          Buffers: shared hit=430 read=2025400, temp written=36115
          ->  Seq Scan on vector.component_live c  (cost=0.00..2117393.91 rows=9156391 width=10) (actual time=0.039..194415.641 rows=9333382 loops=1)
                Output: c.ctid, c.component_id
                Buffers: shared hit=430 read=2025400
Planning time: 6.639 ms
Execution time: 203643.594 ms

Its doing seq scan on both tables and more data - the slower it will be.

admix
  • 1,752
  • 3
  • 22
  • 27
  • Run `ANALYZE table_name` to update stats – Mihai Nov 24 '15 at 17:36
  • Forcing VACUUM and ANALYZE after those queries.. so the stats are good. Thanks, tho. – admix Nov 24 '15 at 17:38
  • 1
    Are you sure you have an index on `component_live(component_id)`? I am a bit suprised that Postgres is doing a seq scan on `component_live` instead of using the index. You could try `set enable_seqscan = false` and see if that forces Postgres to use the index. If it does, and that is indeed faster, you might want to think about adjusting `random_page_cost`. What is the exact version of Postgres you are running? –  Nov 24 '15 at 19:20
  • It is faster with seq scan off, so I will use that solution and might play with configs a little as well. Execution Time: 60s Thanks a lot for you help. PS: I will mark your answer as resolved, however can you please update it with seq scan option. – admix Nov 24 '15 at 19:52

2 Answers2

15

You have way too many joins in there:

set enable_seqscan = false; -- forcing to use indexes

DELETE FROM component_apportionment
WHERE not exists (select 1
                  from component_live c
                  where c.component_id = component_apportionment.component_id);

Will do the same thing and should be much faster, especially if you have indexes on the component_id columns.

admix
  • 1,752
  • 3
  • 22
  • 27
  • Thanks and yea I do have indexes on all the used columns. Will try that one. – admix Nov 24 '15 at 17:20
  • The query did work, however the speed is even worse than the one I'm currently using. -- Execution time: 260536.836 ms – admix Nov 24 '15 at 17:32
  • @admix: Probably the problem is not on the performance of "finding" which records have to be deleted (exists way or in way), but the deletion it self. How many records do you estimate that will be deleted? Does the table have any trigger on "delete event"? Does the table have foreign keys references? – Christian Nov 24 '15 at 18:01
  • To check if the problem is (or is not) on the "finding", replace the `delete` for `select component_id` and run the query again and see how long it will take and how many records it will get. – Christian Nov 24 '15 at 18:07
  • The table from where I need to delete the records doesn't have any foreign key. The trigger 'on delete' is not an option, because after deleting a record from component_live table the insert comes in with new data, however that insert could be different from what was deleted or it could be the same. That is why the cleanup after all the operations is needed. Estimated records that will be deleted from 0 to approx 1000. Such difference is because of single inserts and bulk inserts. – admix Nov 24 '15 at 18:09
  • @admix Can you add the `explain (analyze, verbose, buffers)` output for my query to your question. Maybe it also spills stuff to disk –  Nov 24 '15 at 18:17
  • @a_horse_with_no_name updated my question with more info. Thanks for helping out. I reanalyzed and revacuumed everythin so the execution time went down a little. I was thinkig maybe to increase the memory in the postgres config. – admix Nov 24 '15 at 18:36
  • @ChristianB.Almeida the select itself takes 80%-90% of the time. In my case now it returns only 1 record (thats the test usecase I'm working on). – admix Nov 24 '15 at 18:37
  • @admix 80 million records each table.... that is a lot of data... What is the server RAM size? What is the value of setting `effective_cache_size` in postgresql.conf? – Christian Nov 24 '15 at 19:16
  • The postgres server has 32Gb of RAM. The effective_cache_size = 20Gb – admix Nov 24 '15 at 19:19
  • @adminx (I can be wrong, but) I think the `shared hit=20875` in explain is pretty low... AFAIK, this shows us that the postgres is not finding records regularly in memory, and then it has to query the disk. So, how is your shared buffers parameter? By the way: I made a test here, with only 1million records each table (but my server has only 4GB ram), and it took just 1sec to run. The diff was that my `shared hits = 1010548`, which is way better than yours. – Christian Nov 24 '15 at 19:59
2

The exists way:

delete 
from   component_apportionment ca
where  not exists 
       (select 1
        from component_live cl
        where cl.component_id = ca.component_id
       );

Or the in way:

delete 
from   component_apportionment
where  component_id not in 
       (select component_id
        from component_live
       );

Also, create indexes on both tables to the component_id columns.

UPDATE

I made a script for testing:

-- table creating and populating (1,000,000 records each)
drop table if exists component_apportionment;
drop table if exists component_live;
create table component_live (component_id numeric primary key);
create table component_apportionment (id serial primary key, component_id numeric);
create index component_apportionment_idx on component_apportionment (component_id);
insert into component_live select g from generate_series(1,1000000) g;
insert into component_apportionment (component_id) select trunc(random()*1000000) from generate_series(1,1000000) g;
analyze verbose component_live;
analyze verbose component_apportionment;



EXPLAIN (analyze, verbose, buffers)
select component_id
from   component_apportionment ca
where  not exists 
       (select 1
        from component_live cl
        where cl.component_id = ca.component_id
       );


Merge Anti Join  (cost=0.85..61185.85 rows=1 width=6) (actual time=0.013..1060.014 rows=2 loops=1)
  Output: ca.component_id
  Merge Cond: (ca.component_id = cl.component_id)
  Buffers: shared hit=1010548
  ->  Index Only Scan using component_apportionment_idx on admin.component_apportionment ca  (cost=0.42..24015.42 rows=1000000 width=6) (actual time=0.006..460.318 rows=1000000 loops=1)
        Output: ca.component_id
        Heap Fetches: 1000000
        Buffers: shared hit=1003388
  ->  Index Only Scan using component_live_pkey on admin.component_live cl  (cost=0.42..22170.42 rows=1000000 width=6) (actual time=0.005..172.502 rows=999998 loops=1)
        Output: cl.component_id
        Heap Fetches: 999998
        Buffers: shared hit=7160
Total runtime: 1060.035 ms
Christian
  • 7,062
  • 9
  • 53
  • 79
  • The 'not in' query was taking forever..so I killed it. The 'no exist' one is similar to @a_horse_with_no_name answer. Thanks for the input. – admix Nov 24 '15 at 17:44
  • Yes, it's identical... He's a faster typewriter than me. – Christian Nov 24 '15 at 17:56