I have a very large table with more than 10 million records. I want to find duplicates based on some fields matching and some fields not matching in it.
The query currently I am using is below:
SELECT DISTINCT MainTable.[lineitemid]
FROM [dbo].[lineitem] MainTable
INNER JOIN [dbo].[lineitem] AS ChildTable
ON ChildTable.invoicedate = MainTable.invoicedate
AND LEFT(ChildTable.vendorname, 4) = LEFT(MainTable.vendorname, 4)
AND ChildTable.invoiceid <> MainTable.invoiceid AND -- Invoice ID column not matching
ChildTable.documentcurrencyamount = MainTable.documentcurrencyamount
WHERE ChildTable.lineitemid <> MainTable.lineitemid AND -- LineItemId is PK
MainTable.projectid = 1125 AND ChildTable.projectid = 1125 -- Duplicates should be identified with specific ProjectId
This query is working fine if the number of records for ProjectId is under 100,000. When the ProjectId records are more than 1 million, while executing this query, the tempdb size shoots up to 100 GB and causing low disk space issues. The query is taking forever to execute.
Please help me in optimizing the query.
Added the below lines after getting answer for the above query....
Thanks a lot, @Gordon-Linoff. The query you suggested worked much faster. The VendorName is from a different table. Can I include a inner join as shown below?
SELECT li1.[LineItemId]
FROM [dbo].[LineItem] li1
INNER JOIN VendorMaster vm1 ON li1.VendorNumber=vm1.VendorNumber
AND vm1.CompanyCode = li1.CompanyCode
WHERE EXISTS (SELECT 1
FROM [dbo].[LineItem] as li2
INNER JOIN VendorMaster vm2 on li2.VendorNumber = vm2.VendorNumber
AND vm2.CompanyCode = li2.CompanyCode
WHERE li2.InvoiceDate = li.InvoiceDate and
LEFT(li2.VendorName, 4) = LEFT(li1.VendorName, 4) and
li2.InvoiceId <> li1.InvoiceId and -- Invoice ID column not matching
li2.DocumentCurrencyAmount = li1.DocumentCurrencyAmount and
li2.LineItemId <> li1.LineItemId and
li2.ProjectId = li1.ProjectId
li2.VendorNumber = li.VendorNumber)
AND li.ProjectId = 1125
Is it an efficient approach?