SQL Remove almost duplicate rows

Question

I have a table that contains unfortuantely bad data and I'm trying to filter some out. I am sure that the LName, FName combonation is unique since the data set is small enough to verify.

LName, FName, Email
-----  -----  -----
Smith  Bob    bsmith@example.com
Smith  Bob    NULL
Doe    Jane   NULL
White  Don    dwhite@example.com

I would like to have the query results bring back the "duplicate" record that does not have a NULL email, yet still bring back a NULL Email when there is not a duplicate.

E.g.

Smith Bob   bsmith@example.com
Doe   Jane  NULL
White Don   dwhite@example.com

I think the solution is similar to Sql, remove duplicate rows by value, but I don't really understand if the asker's requirements are the same as mine.

Any suggestions?

Thanks

Thanks for the solutions from Cybernate and Michael Goldshteyn. Went with FooLman b/c it was the first solution that did the job. Interesting to see the different methods for resolution. — jmmr, Jan 03 '11 at 14:32

score 8 · Answer 1 · answered Dec 30 '10 at 22:09

8

You can use ROW_NUMBER() analytic function:

SELECT *
  FROM (
                SELECT a.*, ROW_NUMBER() OVER(PARTITION BY LName, FName ORDER BY Email DESC) rnk
                    FROM <YOUR_TABLE> a
                ) a
WHERE RNK = 1

answered Dec 30 '10 at 22:09

Chandu

81,493
19
133
134

+1 This will probably be more efficient than a self join. This technique can be used to delete as well as per my answer. – Martin Smith Dec 30 '10 at 22:29

score 7 · Accepted Answer · edited Dec 31 '10 at 10:39

7

This drops the null rows if there are any non null values.

SELECT  lname
        , fname
        , MIN(email)
FROM    YourTable
GROUP BY
        lname
        , fname

Test script

DECLARE @Test TABLE (
  LName VARCHAR(32)
  , FName VARCHAR(32)
  , Email VARCHAR(32)
)

INSERT INTO @Test
  SELECT 'Smith', 'Bob', 'bsmith@example.com'
  UNION ALL SELECT 'Smith', 'Bob', 'NULL'
  UNION ALL SELECT 'Doe', 'Jane', 'NULL'
  UNION ALL SELECT 'White', 'Don', 'dwhite@example.com'

SELECT  lname
        , fname
        , MIN(Email)        
FROM    @Test
GROUP BY
        lname
        , fname

edited Dec 31 '10 at 10:39

Lieven Keersmaekers

57,207
13
112
146

answered Dec 30 '10 at 22:05

FooLman

457
4
11

1

+1 This is by far the *easiest* solution posted that does what is asked. The community seems to suffer from overengeneering . – Lieven Keersmaekers Dec 31 '10 at 10:23
@Lieven - It won't work though! Where is there any logic to keep the null rows if there aren't any non null ones? – Martin Smith Dec 31 '10 at 10:34
@Martin, the query produces the output requested by op for given inputs doesn't it (or have I goofed up again)? – Lieven Keersmaekers Dec 31 '10 at 10:36
@Lieven - Ah I see this is giving the records to keep not the ones to delete. Sorry! as you were! – Martin Smith Dec 31 '10 at 10:38
@Martin, I don't think OP wants to delete any records, he merely wants a select statement. – Lieven Keersmaekers Dec 31 '10 at 10:40
Probably actually. I interpreted the title as meaning Remove from table. I guess they may just mean remove from select results though! – Martin Smith Dec 31 '10 at 10:41
Sorry for the confusion from the title, yes, I want to 'remove' the null rows from the select results. – jmmr Jan 03 '11 at 14:17

Michael Goldshteyn · Answer 3 · 2010-12-30T22:17:36.523

3

Here is a relatively simple query that uses standard SQL and does just this:

SELECT * FROM Person P
WHERE Email IS NOT NULL OR -- Take all people with non-null e-mails
      Email IS NULL AND    -- and all people with null e-mails, as long as
        NOT EXISTS         -- there is no duplicate record of the same person
          (SELECT *        -- with a non-null e-mail
           FROM Person P2 
           WHERE P2.LName=P.LName AND P2.FName=P.FName AND P2.Email IS NOT NULL)

edited Dec 30 '10 at 22:17

answered Dec 30 '10 at 22:06

Michael Goldshteyn

71,784
24
131
181

Are you implying that `row_number` isn't standard SQL? – Martin Smith Dec 30 '10 at 22:24
I am not implying anything - just offering a simple solution using nothing more than a standard SQL query involving a sub-select. But, if you want the answer to your question regarding ROW_NUMBER, no it's not standard SQL and neither is PARTITION BY. – Michael Goldshteyn Dec 31 '10 at 01:12
ANSI SQL 1999 [according to here](http://ss64.com/ora/syntax-analytic.html) Not that standard SQL was ever specified by the OP anyway. – Martin Smith Dec 31 '10 at 10:46

score 1 · Answer 4 · answered Dec 30 '10 at 22:26

1

Since there are plenty of SQL solutions posted already, you may want to create a data fix to remove the bad data, then add the necessary constraints to prevent bad data from ever being inserted. Bad data in a database is a side effect of poor design.

answered Dec 30 '10 at 22:26

Billworth Vandory

5,003
5
29
34

I agree and understand, but not much I can do about it in enterprise IT land. It's the reality of the data I have to work with. – jmmr Dec 31 '10 at 00:49
@jrm82, it is all the more important to fix such things in Enterprise applications! Just because it's apin doesn't mean you shouldn't do it. This will cause no end of problems forever if you don't fix. – HLGEM Dec 31 '10 at 16:02
HLGEM - I don't 'own' the data and I don't have the ability to work on whatever I want. – jmmr Jan 03 '11 at 14:16

SQL Remove almost duplicate rows

4 Answers4

Linked