0

Context: I have this table called product and I am writing a script to clean up some old products, but have to follow rules, which can be deleted. One of the rules, is to only delete it if its not referenced by some other specific tables.

So currently I find all the products (using other simler rules) and then try to filter that IDs list on all referenced products like:

SELECT id FROM product
WHERE 
  id NOT IN (SELECT product_id FROM sale_line)
  AND id NOT IN (SELECT product_id FROM purchase_line)

On smaller number of products, it works. But in real scenario, there is over a million rows. And this query just takes forever.

Is there any more efficient way to determine if table row is referenced on some other specific tables, before trying to delete it?

Andrius
  • 19,658
  • 37
  • 143
  • 243
  • Actually, that script completed, but it took 10 minutes:) – Andrius Mar 17 '23 at 13:15
  • Try NOT EXISTS instead of NOT IN –  Mar 17 '23 at 13:20
  • @a_horse_with_no_name maybe I am using it incorrectly? But if I replace `NOT IN` with `NOT EXISTS`, it just does not find any result at all then (I also have to remove `id` in WHERE as it gives syntax error. – Andrius Mar 17 '23 at 13:34

2 Answers2

1

Typically NOT EXISTS is faster than NOT IN:

SELECT p.id 
FROM product p
WHERE NOT EXISTS (select * from sale_line sl where sl.product_id = p.id)
  AND NOT EXISTS (select * from purchase_line pl where pl.product_id = p.id)
0

I have my idea.

  1. Gathering all product_id from both tb_sale_line and tb_purchase_line by distinct and save to tb_all_product_id (This table has all product id is pk, and row_num as key (auto increment))

  2. Write console/script app for fetch product id every 100,1000, or 1M rows by specified range of row_num and search in old table by exists condition.

Sum Settavut
  • 1
  • 1
  • 1