1

I have a table in SQL Server 2000 with data similar to the following:

ReferenceNumber    ReferenceValue
00001              Not assigned
00002              Not assigned
00002              ABCDE

in which each ReferenceNumber can appear multiple times in the table, either with a ReferenceValue of 'Not assigned' or a true ReferenceValue.

I want to dump the data into a cleaned-up table with only one row per ReferenceNumber and a true ReferenceValue if it exists, or 'Not assigned' if there are no true ReferenceValues.

I can see how to do it with two queries:

SELECT TOP 1 ReferenceNumber, ReferenceValue
INTO clean
FROM duplicates
WHERE ReferenceValue <> 'Not assigned'

INSERT INTO clean(ReferenceNumber, ReferenceValue)
SELECT TOP 1 ReferenceNumber, ReferenceValue
WHERE ReferenceValue = 'Not assigned' 
AND ReferenceNumber NOT IN (SELECT ReferenceNumber FROM clean)

but I'm thinking there must be a better way. Any ideas?

OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
taserian
  • 624
  • 6
  • 17

2 Answers2

2

For SQL SERVER 2000, this is probably easiest. First clause = "real" values, second clause where not found in first clause. And an extension of your idea.

SELECT d2.ReferenceNumber, d2.ReferenceValue
FROM duplicates d2
WHERE d2.ReferenceValue <> 'Not assigned'
UNION ALL
SELECT d1.ReferenceNumber, d1.ReferenceValue
FROM duplicates d1
WHERE NOT EXISTS (SELECT *
         FROM duplicates d2
         WHERE d2.ReferenceNumber = d1.ReferenceNumber AND
                 d2.ReferenceValue <> 'Not assigned')

However, what criteria do you want to tie break between "true" reference values? or just pick one?

gbn
  • 422,506
  • 82
  • 585
  • 676
  • For this specific problem, a ReferenceNumber will either have 'Not assigned' or have a unique ReferenceValue, so there's no issue with ties. – taserian Oct 22 '09 at 19:39
  • @taserian: removed my aggregate – gbn Oct 22 '09 at 19:42
  • I kept the aggregate, since even though a RefNumber can have one true RefValue, that RefNumber-RefValue pair can appear multiple times in the table. – taserian Oct 22 '09 at 19:57
  • This is three reads and a self-join, where you only need one read and no joins. – Peter Radocchia Oct 22 '09 at 20:16
  • @gbn: did you have parallel execution in mind when you suggested this method? – Peter Radocchia Oct 25 '09 at 02:32
  • @Peter: no. Just for clarity. I understand your solution (I upvoted) but my solution is hopefully more clear and obvious. – gbn Oct 25 '09 at 08:55
  • @gbn: do you think this would this use parallel reads? I think it might for the reads and joins, but I have a blind spot for parallelism and don't really know how to gauge the relative cost vs. an intermediate spool. Remus pointed out the possibility here: http://stackoverflow.com/questions/1618560/select-and-merge-rows-in-a-table-in-sql-stored-procedure/1618656#1618656. I was certain before, not so certain now. – Peter Radocchia Oct 25 '09 at 13:20
  • @Peter: heck. I don't know. However, spooling can be expensive and often one of the hardest things to tune away. I guess "it depends": indexes, number of rows, does cost meet parallel threshold etc. – gbn Oct 25 '09 at 14:23
  • @gbn: I appreciate the discussion. Thanks! – Peter Radocchia Oct 25 '09 at 14:50
2

Something like this:

SELECT 
  ReferenceNumber
, ReferenceValue = ISNULL(MAX(NULLIF(ReferenceValue,'Not assigned')),'Not assigned')
INTO Table1_Clean
FROM Table1
GROUP BY
  ReferenceNumber

MAX() ignores NULLs, so convert whatever you don't want to NULL first, then MAX(), then convert NULLs back to a dummy value.

One pass, in-line, can't get much more efficient.

Peter Radocchia
  • 10,710
  • 2
  • 34
  • 56