Best self join technique when checking for duplicates

Question

i'm trying to optimize a query that is in production which is taking a long time. The goal is to find duplicate records based on matching field values criteria and then deleting them. The current query uses a self join via inner join on t1.col1 = t2.col1 then a where clause to check the values.

select * from table t1 
inner join table t2 on t1.col1 = t2.col1
where t1.col2 = t2.col2 ...

What would be a better way to do this? Or is it all the same based on indexes? Maybe

select * from table t1, table t2
where t1.col1 = t2.col1, t2.col2 = t2.col2 ...

this table has 100m+ rows.

MS SQL, SQL Server 2008 Enterprise

select distinct t2.id
    from table1 t1 with (nolock)
    inner join table1 t2 with (nolock) on  t1.ckid=t2.ckid
    left join table2 t3 on t1.cid = t3.cid and t1.typeid = t3.typeid
    where 
    t2.id > @Max_id and
    t2.timestamp > t1.timestamp and
    t2.rid = 2 and
    isnull(t1.col1,'') = isnull(t2.col1,'') and 
    isnull(t1.cid,-1) = isnull(t2.cid,-1) and
    isnull(t1.rid,-1) = isnull(t2.rid,-1)and 
    isnull(t1.typeid,-1) = isnull(t2.typeid,-1) and
    isnull(t1.cktypeid,-1) = isnull(t2.cktypeid,-1) and
    isnull(t1.oid,'') = isnull(t2.oid,'') and
    isnull(t1.stypeid,-1) = isnull(t2.stypeid,-1)  

    and (
            (
                t3.uniqueoid = 1
            )
            or
            (
                t3.uniqueoid is null and 
                isnull(t1.col1,'') = isnull(t2.col1,'') and 
                isnull(t1.col2,'') = isnull(t2.col2,'') and
                isnull(t1.rdid,-1) = isnull(t2.rdid,-1) and 
                isnull(t1.stid,-1) = isnull(t2.stid,-1) and
                isnull(t1.huaid,-1) = isnull(t2.huaid,-1) and
                isnull(t1.lpid,-1) = isnull(t2.lpid,-1) and
                isnull(t1.col3,-1) = isnull(t2.col3,-1) 
            )
    )

@Mitch no it's one table joining the same table. How is it not a self join? — Dustin Davis, May 02 '11 at 15:56
Of course, calling a table "table" probably creates syntax errors. And "create table table ..." just sounds silly, like the old Libby's commercial. But it's just an example. — Jay, May 02 '11 at 16:20

score 13 · Accepted Answer · edited May 23 '17 at 12:09

13

Why self join: this is an aggregate question.

Hope you have an index on col1, col2, ...

--DELETE table
--WHERE KeyCol NOT IN (
select
    MIN(KeyCol) AS RowToKeep,
    col1, col2, 
from
    table
GROUP BY
    col12, col2
HAVING
   COUNT(*) > 1
--)

However, this will take some time. Have a look at bulk delete techniques

edited May 23 '17 at 12:09

Community

1
1

answered May 02 '11 at 15:56

gbn

422,506
82
585
676

i've updated my question with the query (column/table names changed). Will your suggestion still work? – Dustin Davis May 02 '11 at 16:11
@Titan278: should do. No need to GROUP BY with ISNULL and you should use EXISTS for table2/t3 check (avoids use of DISTINCT) – gbn May 02 '11 at 16:19

score 1 · Answer 2 · answered May 02 '11 at 15:53

1

You can use ROW_NUMBER() to find duplicate rows in one table.

You can check here

answered May 02 '11 at 15:53

Bruno Costa

2,708
2
17
25

oooh. Going to be nasty over this size of table. I'd use an aggregate – gbn May 02 '11 at 16:03
Yes, but if the rows are exactly the same, I don't think that aggregate would work, at least in a simply way. – Bruno Costa May 02 '11 at 16:19
You assume no PK: you can have duplicates outside of the PK. OP's updated shows there is a PK but still duplicates. It doesn't change the fact that ROW_NUMBER will run like a dog. – gbn May 02 '11 at 16:22
Well, on examples we can see that he has join t1.col1 = t2.col1 and in where col2 = col2 (etc). So I assume, that every fields in the rows are equals. If you want manipulate the data in that situation, I think Row_number is the best, every if it runs like a dog (which is true compare with aggregate). – Bruno Costa May 02 '11 at 16:27
`select distinct t2.id...` shows an id column outside of the duplicate checks. Performance *does* make a difference because of the available window to make the data changes (with backups, considering rollback time on error etc). I have billion+ row tables and 24/6 operation: my last large DML clean up took around 13 hours: work out how much slack I had... – gbn May 02 '11 at 16:32
@gbn, I know performance is important. I was talking about the two previous examples. That example wasn't there when I answer to this question :-). – Bruno Costa May 02 '11 at 16:33

score 1 · Answer 3 · answered May 02 '11 at 15:55

The two methods you give should be equivalent. I think most SQL engines would do exactly the same thing in both cases.

And, by the way, this won't work. You have to have at least one field that is differernt or every record will match itself.

You might want to try something more like:

select col1, col2, col3
from table
group by col1, col2, col3
having count(*)>1

score 1 · Answer 4 · answered May 02 '11 at 16:18

For table with 100m+ rows, Using GROUPBY functions and using holding table will be optimized. Even though it translates into four queries.

STEP 1: create a holding key:

SELECT col1, col2, col3=count(*)
INTO holdkey
FROM t1
GROUP BY col1, col2
HAVING count(*) > 1

STEP 2: Push all the duplicate entries into the holddups. This is required for Step 4.

SELECT DISTINCT t1.*
INTO holddups
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2

STEP 3: Delete the duplicate rows from the original table.

DELETE t1
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2

STEP 4: Put the unique rows back in the original table. For example:

INSERT t1 SELECT * FROM holddups

score 0 · Answer 5 · answered Jan 18 '13 at 08:50

In my experience, SQL Server performance is really bad with OR conditions. Probably it is not the self join but that with table3 that causes the bad performance. But without seeing the plan, I would not be sure.

In this case, it might help to split your query into two: One with a WHERE condition t3.uniqueoid = 1 and one with a WHERE condition for the other conditons on table3, and then use UNION ALL to append one to the other.

score 0 · Answer 6 · edited May 02 '11 at 16:23

0

To detect duplicates, you don't need to join:

SELECT col1, col2
FROM table
GROUP BY col1, col2
HAVING COUNT(*) > 1

That should be much faster.

edited May 02 '11 at 16:23

gbn

422,506
82
585
676

answered May 02 '11 at 15:53

Christoph Walesch

2,337
2
32
44

on a 100 million row table? I don't think so! – Mitch Wheat May 02 '11 at 15:54
@Mitch Wheat: any technique will run like a dog over 100m + rows... especially ROW_NUMBER based fucntions – gbn May 02 '11 at 15:58
2

@Mitch: The practical question is not, "Is there a way to write this that will run in undex x second?" but "What is the fastest query to give the desired result." – Jay May 02 '11 at 16:21
@gbn: I disagree: some techniques will be Chihuahuas and some will be Rottweilers. And where do I mention ROW_NUMBER()? – Mitch Wheat May 02 '11 at 23:44

Best self join technique when checking for duplicates

6 Answers6

Linked