2

I have a set of rows that contain duplicate entries because the data originates from multiples sources. I also have a separate reference table that indicates the priority of those data sources.

Does anyone have good tips for the most effective t-SQL to deduplicate this list?

Basically I have:

SELECT a.*, b.priority 
FROM tableA as a
JOIN tableB as b
ON b.sourceId = a.sourceId

I have been placing this into a temp table and then deleting in an odd way that I suppose could be more efficient.

DELETE ta
FROM #tmp ta
JOIN #tmp tb
ON ta.duplicateId = tb.duplicateId
WHERE ta.priority < tb.priority

Table A has the same columns for the two sources, but the data can differ - so they may have different prices. The challenge is that I must take the price (and all other info) from the row that comes from the source with the highest priority. To complicate matters, I do not have data for every item from ALL sources.

So, item 1 may have data from source A and B, while item 2 may only have it from source B and C. Thus, the deletion needs to happen on a per-unique item basis.

Sam
  • 7,252
  • 16
  • 46
  • 65
jkelley
  • 2,570
  • 3
  • 21
  • 24
  • 1) When you say you have duplicate entries, are all columns in TableA the same for two rows that are duplicate? If they aren't and tableA.SourceId is different, and you have one and only one matching source Id in tableB, and you want to delete the duplicates, then just run your delete on tableA instead of the temp table. – Shannon Severance Jul 31 '09 at 03:15

2 Answers2

1

I think you could do something like this:

SELECT a.*, b.priority 
FROM tableA as a
JOIN tableB as b
ON b.sourceId = a.sourceId and b.priority = (select max(priority) from tableB where b.sourceId = a.sourceId)

I can't remember if tSql will have a in scope for the subquery or not though.

Jon
  • 2,085
  • 2
  • 20
  • 28
  • It does have scope at that point - Can you explain to me how that nested select is joining to properly get the right priority? – jkelley Jul 31 '09 at 18:04
  • 1
    Sure so this query will run for every row of tableA so the result will be the highest for the sourceId of the row it's being run on it. It's like running a function on every row that's returned. – Jon Jul 31 '09 at 18:17
0

Try using ROW_NUMBER to find the ones you want to keep, and then getting rid of the rest.

...and remember to end the previous statement with a semi-colon...

with t as (
SELECT a.*, row_number() over (partition by a.sourceid order by b.priority desc) as priorityorder
FROM tableA as a
JOIN tableB as b
ON b.sourceId = a.sourceId
)
--select * from t
delete t 
where priorityorder > 1;

Rob

Rob Farley
  • 15,625
  • 5
  • 44
  • 58