1

I'm facing a problem where I have to select rows from a table ('CHILD') that should have a foreign key towards another table ('PARENT'). The problem is that the foreign key is broken (long story short, the table is partitioned and for some reason there are orphans) and I must clean the child table before restoring the foreign key. What I am trying to do is (roughly) :

SELECT child.ID 
from CHILD child 
WHERE child.PARENT_ID NOT IN 
(
  SELECT parent.ID FROM PARENT parent 
);

This seems correct (judging by the results), however it is quite inefficient: there are 1M results, and the child table holds 100M+ rows. Since I have to delete each row coming from the results of that query I am using pagination, but afaict this means that the NOT IN query is repeated each time. For this reason I am wondering if there is any way to improve the performance of the query. I have tried joining the tables but I have realized it won't work because I should join on child.PARENT_ID = parent.ID, so there would be no result. So the question is : is there any way to rewrite the NOT IN query so that it is more performant?

francesco foresti
  • 2,004
  • 20
  • 24
  • Are the parent and child keys `not null`? The logic of `not in` with null or potentially null values can force the optimiser into an inefficient row-by-row approach, and adding `is not null` conditions to both sides of the join can help. This is a bit of a guess though - table details and an execution plan would help narrow down the issue. – William Robertson Dec 28 '17 at 08:41
  • yes, both fields have a not null constraint (parent field is primary key, child field is foreign key to parent). – francesco foresti Dec 28 '17 at 08:48

3 Answers3

2

The not in query is not repeated each time. The query optimizer will likely do all sorts of optimizations. Then again, for a query like this, it may decide to scan each row, regardless of indexes, purely based on the balance between the number of rows between the two tables.

You could alternatively write it as a LEFT JOIN as demonstrated below, but it's arguably harder to read, because it doesn't communicate the intention very well (although that's not bad per se since this is a one time job only). And chances are that the optimizer just treats this as the same query.

SELECT child.ID 
FROM CHILD child 
LEFT JOIN PARENT parent ON parent.ID = child.PARENT_ID
WHERE parent.ID IS NULL

And the third syntax would be to use NOT EXISTS.

SELECT child.ID
FROM CHILD child
WHERE NOT EXISTS (
    SELECT * 
    FROM PARENT parent 
    WHERE parent.ID = child.PARENT_ID
)
GolezTrol
  • 114,394
  • 18
  • 182
  • 210
  • Note, `EXISTS` behave different than `IN` in case of NULL values. In case `parent.ID` has any NULL values you will get rows by using `NOT EXISTS` but not by `NOT IN` – Wernfried Domscheit Dec 28 '17 at 08:36
  • @GolezTrol I have tried using the `LEFT JOIN` construct, but since I would be joining on parent's primary key the query wouldn't return any result (it's not that there are null values, the rows aren't just there). – francesco foresti Dec 28 '17 at 08:51
  • @WernfriedDomscheit Thanks for the addition. In this case it won't be an issue, since parent.ID is the PK of the table, but in general it's definitely something to take into account. – GolezTrol Dec 28 '17 at 08:52
  • @francescoforesti That's the idea of a left join, to still return the records from the first table if there is no match in the second table. The where clause makes sure that *only* those records without match are returned. That query, like the others, should give you the IDs of all childs that don't have a matching parent. – GolezTrol Dec 28 '17 at 08:53
  • @GolezTrol I know You're right, but I've tried both queries (not in and left join) and the left join does return 0 (zero) results. So I have to conclude that they're not equivalent in this case (though I don't know why) – francesco foresti Dec 28 '17 at 08:57
1

I do not think it is necessary consider null values in this case. If a foreign key constraint is applied to the child parent_id column then there should be an primary key or unique constraint on the parent corresponding column.

  • If a primary key constraint is applied then null values are not allowed.
  • When an unique constraint is used then rows with null values will not be treated as unique values referenced by the foreign key.
  • Also if child parent_id is null the foreign key check will be ignored. If we want a strong relation we should apply the NOT NULL constraint on child parent_id.

If child allows nulls in parent_id you can:

update child
   set parent_id = null 
 where parent_id in (select parent_id from child  
                      minus 
                     select id from parent);

In this case we expect that using minus there should be a small amount of orphan keys in the result set.

If parent_id has a not null constraint you use the same syntax to delete rows

delete child
 where parent_id in (select parent_id from child  
                      minus 
                     select id from parent);

It may be a good idea to backup the orphans before making any changes to the child table.

create table child_orphans as
select *
  from child 
  natural join (select parent_id from child
                 minus 
                select id from parent);

This is efficient because of a small result set from the minus operation and also because inner join is used instead of semi join (in). This is true if the optimizer does not interfere too much.

0

This syntax is usually significantly more efficient:

select id from child
where
    parent_id in
    (
        select parent_id from child
        minus
        select id from parent
    );

Now, when you come to delete the child rows, that will be very slow.

Instead, it will be quicker to build a new table of the child records that you want to keep and then rename or drop the old child table (take a backup first!) and rename the new child table to child.

If you then create primary and foreign key constraints you won't get any more broken links.

I'm also concerned that you mention pagination. I'm not sure what you mean, but it sounds rather manual, which is not feasible for a million rows.

Ron Ballard
  • 693
  • 6
  • 8