SQL: Detect duplicate customers

Question

im trying to create a sql query, that will detect (possible) duplicate customers in my database:

I have two tables:

Customer with the columns: cid, firstname, lastname, zip. Note that cid is the unique customer id and primary key for this table.
IgnoreForDuplicateCustomer with the columns: cid1, cid2. Both columns are foreign keys, which references to Customer(cid). This table is used to say, that the customer with cid1 is not the same as the customer with the cid2.

So for example, if i have

a Customer entry with cid = 1, firstname="foo", lastname="anonymous" and zip="11231"
and another Customer entry with cid=2, firstname="foo", lastname="anonymous" and zip="11231".

So my sql query should search for customers, that have the same firstname, lastname and zip and the detect that customer with cid = 1 is the same as customer with cid = 2.

However, it should be possible to say, that customer cid = 1 and cid=2 are not the same, by storing a new entry in the IgnoreForDuplicateCustomer table by setting cid1 = 1 and cid2 = 2.

So detecting the duplicate customers work well with this sql query script:

SELECT cid, firstname, lastname, zip, COUNT(*) AS NumOccurrences
       FROM Customer
 GROUP BY fistname, lastname,zip
       HAVING ( COUNT(*) > 1 )

My problem is, that i am not able, to integrate the IgnoreForDuplicateCustomer table, to that like in my previous example the customer with cid = 1 and cid=2 will not be marked / queried as the same, since there is an entry/rule in the IgnoreForDuplicateCustomer table.

So i tried to extend my previous query by adding a where clause:

    SELECT cid, firstname, lastname, COUNT(*) AS NumOccurrences
               FROM Customer    
    WHERE cid NOT IN (
                     SELECT cid1 FROM IgnoreForDuplicateCustomer WHERE cid2=cid 
                     UNION 
                     SELECT cid2 FROM IgnoreForDuplicateCustomer WHERE cid1=cid
                     )  
     GROUP BY firstname, lastname, zip
     HAVING ( COUNT(*) > 1 )

Unfortunately this additional WHERE clause has absolutely no impact on my result. Any suggestions?

vulkanino · Answer 1 · 2012-02-21T14:23:05.840

1

Edit as per TPete's comment (dind't try it):

SELECT 
    C1.cid, C1.firstname, C1.lastname
FROM 
    Customer C1,
    Customer C2
WHERE
    C1.cid < C2.cid AND 
    C1.firstname = C2.firstname AND 
    C1.lastname = C2.lastname AND 
    C1.zip = C2.zip AND 
    CAST(C1.cid AS VARCHAR)+' ' +CAST(C2.cid AS VARCHAR) <> 
       (SELECT CAST(cid1 AS VARCHAR)+' '+CAST(cid2 AS VARCHAR) FROM IgnoreForDuplicateCustomer I WHERE I.cid1 = C1.cid AND I.cid2 = C2.cid);

Initially I thought that IgnoreForDuplicateCustomer was a field in the customer table.

edited Feb 21 '12 at 14:23

answered Feb 21 '12 at 13:57

vulkanino

9,074
7
44
71

`IgnoreForDuplicateCustomer` is the name of another table, so this will not work. – TPete Feb 21 '12 at 14:09
Hi, i am using mysql and get the error that cid, firstname, lastname is ambiguous and i don't understand why i should use this condition C1.cid < c2.cid, since cid is a auto increment generated id. And what if there is a third customer with cid=99 and firstname, lastname and zip like the other customers (cid=1, cid=2) – sockeqwe Feb 21 '12 at 14:13
The column names need to be qualified, user either `C1` or `C2`. The condition should be `C1.cid` != `C2.cid`, although the result would be the same in this case. – TPete Feb 21 '12 at 14:17
edited to qualify the field names, thanks @TPete. I think != would select two records (less than and greater than both qualify), so my `C1.cid < C2.cid` shuold be right. – vulkanino Feb 21 '12 at 14:24
Thank you, i dont know why, but i get an Syntax error near: VARCHAR)+' ' +CAST(C2.cid AS VARCHAR) <> (SELECT CAST(cid1 AS VARCHAR)+ – sockeqwe Feb 21 '12 at 14:55
mysql doesn't allow to CAST to varchar, maybe you could CAST to CHAR but then you'll have to specify a size parameter. Maybe it would be better to change the query to use a LEFT JOIN. – vulkanino Feb 21 '12 at 15:08
You shouldn't need to cast at all - that should be modifiable to an `EXISTS` clause, regardless of anything else. But yes, an `EXCEPTION JOIN` or `LEFT JOIN ... WHERE NULL` should also get rid of the duplicates (although you may need to make the relationship bi-directional). And ditch the implicit-join syntax, it just makes things slightly hairier. – Clockwork-Muse Feb 21 '12 at 17:03

score 1 · Accepted Answer · answered Feb 21 '12 at 19:35

Here you are:

Select a.*
From (
  select c1.cid 'CID1', c2.cid 'CID2'
  from Customer c1 
  join Customer c2 on c1.firstname=c2.firstname 
    and c1.lastname=c2.lastname and c1.zip=c2.zip
    and c1.cid < c2.cid) a
Left Join (
  Select cid1 'CID1', cid2 'CID2'
  From ignoreforduplicatecustomer one
 Union
  Select cid2 'CID1', cid1 'CID2'
  From ignoreforduplicatecustomer two) b on a.cid1 = b.cid1 and a.cid2 = b.cid2
where b.cid1 is null

This will get you the IDs of duplicate records from customer table, which are not in table ignoreforduplicatecustomer.

Tested with:

CREATE TABLE IF NOT EXISTS `customer` (
 `CID` int(11) NOT NULL AUTO_INCREMENT,
 `Firstname` varchar(50) NOT NULL,
 `Lastname` varchar(50) NOT NULL,
 `ZIP` varchar(10) NOT NULL,
 PRIMARY KEY (`CID`)) 
ENGINE=InnoDB  DEFAULT CHARSET=latin1 AUTO_INCREMENT=100 ;

INSERT INTO `customer` (`CID`, `Firstname`, `Lastname`, `ZIP`) VALUES
(1, 'John', 'Smith', '1234'),
(2, 'John', 'Smith', '1234'),
(3, 'John', 'Smith', '1234'),
(4, 'Jane', 'Doe', '1234');

And:

CREATE TABLE IF NOT EXISTS `ignoreforduplicatecustomer` (
 `CID1` int(11) NOT NULL,
 `CID2` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;


INSERT INTO `ignoreforduplicatecustomer` (`CID1`, `CID2`) VALUES
(1, 2);

Results for my test setup are:

CID1  CID2
 1     3
 2     3

Hi, thank you very much! But, i dont understand why do I need this "and c1.cid < c2.cid" clause in the first join. But in fact it only works with this additional condition ... — sockeqwe, Feb 22 '12 at 21:46
Well, you want those IDs to be different, so you know it is not the same row. You could use `c1.cid != c2.cid`, but that would get you too many hits, because it will include both cases (`c1.cid < c2.cid` and `c1.cid > c2.cid`) which are essentially the same. In my example you would also get 3, 1 and 3, 2. By using `<` each duplicate is only included once. — TPete, Feb 23 '12 at 06:38

Diego · Answer 3 · 2012-02-21T15:08:43.540

0

crazy but I think it works :)

first I join the customer tables with itself on the names to get the duplicates then I exclud the keys on the IgnoreForDuplicateCustomer table (the union is because the first query returns cid1, cid2 and cid2,cid1

the result will be duplicated but I think you can get the info you need

select c1.cid, c2.cid
from Customer c1 
     join Customer c2 on c1.firstname=c2.firstname 
     and c1.lastname=c2.lastname and c1.zip=c2.zip
     and c1.cid!=c2.cid
except 
(
    select cid1,cid2 from IgnoreForDuplicateCustomer
    UNION
    select cid2,cid1 from IgnoreForDuplicateCustomer
)

second shot:

select firstname,lastname,zip from Customer 
group by firstname,lastname,zip 
having (count(*)>1)
except
select c1.firstname, c1.lastname, c1.zip
from Customer c1 join IgnoreForDuplicateCustomer IG on c1.cid=ig.cid1 join Customer c2 on ig.cid2=c2.cid

third:

select firstname,lastname,zip from (
    select firstname,lastname,zip from Customer 
    group by firstname,lastname,zip 
    having (count(*)>1)
) X
where firstname not in (
select c1.firstname
from Customer c1 join IgnoreForDuplicateCustomer IG on c1.cid=ig.cid1 join Customer c2 on ig.cid2=c2.cid
)

edited Feb 21 '12 at 15:08

answered Feb 21 '12 at 14:02

Diego

34,802
21
91
134

Unfortunately i use mysql and except is not supported by mysql... Maybe mysql NOT EXISTS could help, but im not sure, if youre sql code would work if there are more the two customers with the same firstname, lastname and zip ... – sockeqwe Feb 21 '12 at 14:22
Looks good, except for the `except`. You could replace that with a `Left Join` and only keeping not matching rows from the Union, i guess. – TPete Feb 21 '12 at 14:22
did not think about that. It would yes (I tested), but the result would be too messy :( – Diego Feb 21 '12 at 14:24
hi, I added a second query. try it with the NOT EXISTS – Diego Feb 21 '12 at 14:32
hi, i tried it with NOT EXISTS: select firstname,lastname,zip, COUNT(\*) AS nr from Customer WHERE NOT EXISTS ( select c1.firstname, c1.lastname, c1.zip from Customer c1 join IgnoreForDuplicateCustomer ig on c1.cid=ig.cid1 join Customer c2 on ig.cid2=c2.cid) group by firstname, lastname, zip having (count(*)>1) It works well with having no entry in IgnoreForDuplicateCustomer, but if i add a new IgnoreForDuplicateCustomer entry with INSERT INTO IgnoreForDuplicateCustomer VALUES (1,2) it will return alway a empty result set ... – sockeqwe Feb 21 '12 at 14:52
yes I don't think NOT exist behaves like EXCEPT. I gave another shot, see my edit3. Try to see if it runs and then we can discuss – Diego Feb 21 '12 at 15:08

SQL: Detect duplicate customers

3 Answers3