5

I have a table that contains unfortuantely bad data and I'm trying to filter some out. I am sure that the LName, FName combonation is unique since the data set is small enough to verify.

LName, FName, Email
-----  -----  -----
Smith  Bob    bsmith@example.com
Smith  Bob    NULL
Doe    Jane   NULL
White  Don    dwhite@example.com

I would like to have the query results bring back the "duplicate" record that does not have a NULL email, yet still bring back a NULL Email when there is not a duplicate.

E.g.

Smith Bob   bsmith@example.com
Doe   Jane  NULL
White Don   dwhite@example.com

I think the solution is similar to Sql, remove duplicate rows by value, but I don't really understand if the asker's requirements are the same as mine.

Any suggestions?

Thanks

Community
  • 1
  • 1
jmmr
  • 498
  • 7
  • 17
  • Thanks for the solutions from Cybernate and Michael Goldshteyn. Went with FooLman b/c it was the first solution that did the job. Interesting to see the different methods for resolution. – jmmr Jan 03 '11 at 14:32

4 Answers4

8

You can use ROW_NUMBER() analytic function:

SELECT *
  FROM (
                SELECT a.*, ROW_NUMBER() OVER(PARTITION BY LName, FName ORDER BY Email DESC) rnk
                    FROM <YOUR_TABLE> a
                ) a
WHERE RNK = 1
Chandu
  • 81,493
  • 19
  • 133
  • 134
  • +1 This will probably be more efficient than a self join. This technique can be used to delete as well as per my answer. – Martin Smith Dec 30 '10 at 22:29
7

This drops the null rows if there are any non null values.

SELECT  lname
        , fname
        , MIN(email)
FROM    YourTable
GROUP BY
        lname
        , fname

Test script

DECLARE @Test TABLE (
  LName VARCHAR(32)
  , FName VARCHAR(32)
  , Email VARCHAR(32)
)

INSERT INTO @Test
  SELECT 'Smith', 'Bob', 'bsmith@example.com'
  UNION ALL SELECT 'Smith', 'Bob', 'NULL'
  UNION ALL SELECT 'Doe', 'Jane', 'NULL'
  UNION ALL SELECT 'White', 'Don', 'dwhite@example.com'

SELECT  lname
        , fname
        , MIN(Email)        
FROM    @Test
GROUP BY
        lname
        , fname
Lieven Keersmaekers
  • 57,207
  • 13
  • 112
  • 146
FooLman
  • 457
  • 4
  • 11
3

Here is a relatively simple query that uses standard SQL and does just this:

SELECT * FROM Person P
WHERE Email IS NOT NULL OR -- Take all people with non-null e-mails
      Email IS NULL AND    -- and all people with null e-mails, as long as
        NOT EXISTS         -- there is no duplicate record of the same person
          (SELECT *        -- with a non-null e-mail
           FROM Person P2 
           WHERE P2.LName=P.LName AND P2.FName=P.FName AND P2.Email IS NOT NULL)
Michael Goldshteyn
  • 71,784
  • 24
  • 131
  • 181
  • Are you implying that `row_number` isn't standard SQL? – Martin Smith Dec 30 '10 at 22:24
  • I am not implying anything - just offering a simple solution using nothing more than a standard SQL query involving a sub-select. But, if you want the answer to your question regarding ROW_NUMBER, no it's not standard SQL and neither is PARTITION BY. – Michael Goldshteyn Dec 31 '10 at 01:12
  • ANSI SQL 1999 [according to here](http://ss64.com/ora/syntax-analytic.html) Not that standard SQL was ever specified by the OP anyway. – Martin Smith Dec 31 '10 at 10:46
1

Since there are plenty of SQL solutions posted already, you may want to create a data fix to remove the bad data, then add the necessary constraints to prevent bad data from ever being inserted. Bad data in a database is a side effect of poor design.

Billworth Vandory
  • 5,003
  • 5
  • 29
  • 34
  • I agree and understand, but not much I can do about it in enterprise IT land. It's the reality of the data I have to work with. – jmmr Dec 31 '10 at 00:49
  • @jrm82, it is all the more important to fix such things in Enterprise applications! Just because it's apin doesn't mean you shouldn't do it. This will cause no end of problems forever if you don't fix. – HLGEM Dec 31 '10 at 16:02
  • HLGEM - I don't 'own' the data and I don't have the ability to work on whatever I want. – jmmr Jan 03 '11 at 14:16