How to find duplicates in a large table based on matching and non matching fields?

Question

I have a very large table with more than 10 million records. I want to find duplicates based on some fields matching and some fields not matching in it.

The query currently I am using is below:

SELECT DISTINCT MainTable.[lineitemid] 
FROM   [dbo].[lineitem] MainTable 
       INNER JOIN [dbo].[lineitem] AS ChildTable 
               ON ChildTable.invoicedate = MainTable.invoicedate 
                  AND LEFT(ChildTable.vendorname, 4) = LEFT(MainTable.vendorname, 4) 
                  AND ChildTable.invoiceid <> MainTable.invoiceid AND -- Invoice ID column not matching 
                  ChildTable.documentcurrencyamount = MainTable.documentcurrencyamount 
WHERE  ChildTable.lineitemid <> MainTable.lineitemid AND -- LineItemId is PK 
       MainTable.projectid = 1125 AND ChildTable.projectid = 1125 -- Duplicates should be identified with specific ProjectId

This query is working fine if the number of records for ProjectId is under 100,000. When the ProjectId records are more than 1 million, while executing this query, the tempdb size shoots up to 100 GB and causing low disk space issues. The query is taking forever to execute.

Please help me in optimizing the query.

Added the below lines after getting answer for the above query....

Thanks a lot, @Gordon-Linoff. The query you suggested worked much faster. The VendorName is from a different table. Can I include a inner join as shown below?

SELECT li1.[LineItemId]
FROM [dbo].[LineItem] li1
INNER JOIN VendorMaster vm1 ON li1.VendorNumber=vm1.VendorNumber
AND vm1.CompanyCode = li1.CompanyCode
WHERE EXISTS (SELECT 1
              FROM [dbo].[LineItem] as li2  
            INNER JOIN VendorMaster vm2 on li2.VendorNumber = vm2.VendorNumber
              AND vm2.CompanyCode = li2.CompanyCode
              WHERE li2.InvoiceDate = li.InvoiceDate  and 
                    LEFT(li2.VendorName, 4) = LEFT(li1.VendorName, 4) and 
                    li2.InvoiceId <> li1.InvoiceId and  -- Invoice ID column not matching
                    li2.DocumentCurrencyAmount = li1.DocumentCurrencyAmount and
                    li2.LineItemId <> li1.LineItemId and
                    li2.ProjectId = li1.ProjectId
            li2.VendorNumber = li.VendorNumber)
AND li.ProjectId = 1125

Is it an efficient approach?

Have you looked at the execution plan? What indexes do you have on the table? — Donal, Oct 23 '14 at 03:32
@Donal, I have created an Index on the key fields that I am using for matching and non matching - 'CREATE NONCLUSTERED INDEX [idx_KeyFields] ON [dbo].[LineItem] ([ProjectId],[VendorNumber]) INCLUDE ([LineItemId],[InvoiceId],[InvoiceDate],[DocumentCurrencyAmount])' — Manigandan Dorairaj, Oct 23 '14 at 03:38
I would accept Gordon's answer - if I were you - he knows his stuff. — Donal, Oct 23 '14 at 03:39
@iMan . . . That should be a reasonable approach for vendor name. — Gordon Linoff, Oct 24 '14 at 04:20
@GordonLinoff, thanks. I have added inner join in the Select and Exists to join with Vendor master table on vendor number and company code. Could you please see the modified query (the 2nd one) in the question. Kindly please suggest me any efficient way to improve its performance? — Manigandan Dorairaj, Oct 28 '14 at 03:15

score 2 · Accepted Answer · answered Oct 23 '14 at 03:32

A less expensive way to run this query is to use exists and dispense with the distinct:

SELECT li.[LineItemId]
FROM [dbo].[LineItem] li
WHERE EXISTS (SELECT 1
              FROM [dbo].[LineItem] as li2 on 
              WHERE li2.InvoiceDate = li.InvoiceDate and 
                    LEFT(li2.VendorName, 4) = LEFT(li.VendorName, 4) and 
                    li2.InvoiceId <> li.InvoiceId and  -- Invoice ID column not matching
                    li2.DocumentCurrencyAmount = li.DocumentCurrencyAmount and
                    li2.LineItemId <> li.LineItemId and
                    li2.ProjectId = li.ProjectId
WHERE MainTable.ProjectId = 1125;

For performance, an index on LineItem(ProjectId, InvoiceDate, DocumentCurrencyAmount, VendorName, InvoiceId, LineItemId) would help. You could further speed the query by declaring LEFT(LineItem.VendorName, 4) as a computed column and adding it to the index before VendorName.

thanks, I will try adding computed column and modif y the index to include the columns you have suggested. — Manigandan Dorairaj, Oct 23 '14 at 03:42
Thanks a lot, @Gordon-Linoff. This query worked much faster. I have added a join with another table, could you look at my post and kindly suggest? — Manigandan Dorairaj, Oct 23 '14 at 13:03

How to find duplicates in a large table based on matching and non matching fields?

1 Answers1