SQL Server : Find and Group Duplicates by "weak" criterias

Question

I try to list and group some potential duplicates from a Person table.

The schema looks like this:

Id    LastName      OriginalName    FirstName
---------------------------------------------
1     Nolte         Huber           Silvia
2     Nolte                         Johann
3     Huber                         Milan
4     Huber                         Silvia
5     Abacherli                     Adrian
6     Abächerli                     Adrian    
7     Meier                         Hans
8     Meier                         Urs
9     Meyer                         Hans
10    Meier                         Urs
11    Hermann                       Marco
12    Huber                         Milan
13    Meyer                         Hans

Expected result:

GroupNumber   Id    LastName      OriginalName    FirstName 
-----------------------------------------------------------
1             5     Abacherli                     Adrian
1             6     Abächerli                     Adrian  
2             3     Huber                         Milan
2             12    Huber                         Milan
3             4     Huber                         Silvia
3             1     Nolte         Huber           Silvia
4             7     Meier                         Hans
4             9     Meyer                         Hans
4             13    Meyer                         Hans
5             8     Meier                         Urs
5             10    Meier                         Urs

Explanation:

I want to group rows that are near matches and list them in a grid of a web-application (ASP.NET MVC). A considered duplicate must have at least:

same LastName and same FirstName OR
LastName like OrginalName and same FirstName

To make things more complicated, "same" means a phonetic match (i.e via SOUNDEX or a similar function): Meyer == Meier == meier.

Technology in use:

Microsoft SQL Server 2012
Telerik DataAccess ORM
.NET Framework 4.5, C#

Expected answer:

A pure SQL query OR
A stored procedure OR
A combination of a SQL query / SP and LINQ query for the ORM in C#

All approaches that I have worked out so far, missing the GroupNumber. This is such a (non-working) query:

SELECT 
    Id, LastName, FirstName 
FROM 
    Person p1,
    (SELECT
     p1.Id AS Id1
     FROM Person p1
     INNER JOIN Person p2
     ON (p1.LastName LIKE p2.LastName OR p1.LastName LIKE p2.OriginalName)
     AND p1.FirstName LIKE p2.FirstName AND p1.Id <> p2.Id
     GROUP BY p1.Id
     HAVING COUNT(*) > 1) AS p2
WHERE 
    p1.Id IN (SELECT Id1)
ORDER BY
    p1.LastName, FirstName, Id

[Bad habits to kick : using old-style JOINs](https://sqlblog.org/2009/10/08/bad-habits-to-kick-using-old-style-joins) - that old-style *comma-separated list of tables* style should **no longer be used** and instead it is recommended to use the **proper ANSI JOIN** syntax introduced with the ANSI-**92** SQL Standard (more than **20 years** ago) — marc_s, Nov 06 '14 at 11:32
Have you thought of using Microsoft Data Quality Services (part of SQL Server) to identify duplicates and close matches? This can output to a table which can be displayed in a web form? — Steve Ford, Nov 06 '14 at 12:35
@SteveFord: I had a quick look at DQS. In terms of duplicate finding and grouping it will fulfill the requirements. But it seems that there is no API (see [link](http://stackoverflow.com/questions/15293671/use-of-dqs-apis)) - so I cannot integrate the results in the existing web-application. — flo, Nov 06 '14 at 12:54
@flo Agreed there is no API, but you can schedule DQS to run via an SSIS package with results into a table which can be displayed. Alternatively why not just use the DQS client to display the matches? — Steve Ford, Nov 06 '14 at 15:35
@SteveFord thanks a lot! I'm currently reviewing the answer of raphaël-althaus and your answer. On the first look your answer seems to be a little cleaner but I have to see how they perform on my dataset of 110'000 persons (and if the results differ). On DQS: It seems to be a little overkill, but probably I will have another look - the similiarity of matches seems to use a better algorithm compared to SOUNDEX. — flo, Nov 07 '14 at 16:11

Steve Ford · Accepted Answer · 2014-11-06T16:26:54.137

How about this:

SQL Fiddle

MS SQL Server 2012 Schema Setup:

CREATE TABLE Person
( ID Int,
  LastName Varchar(50),
  OriginalName Varchar(50),
  FirstName varchar(50)
)

INSERT INTO Person
VALUES
  (1, 'Nolte', 'Huber','Silvia'),
  (2,'Nolte', '', 'Johann'),
  (3,'Huber', '', 'Milan'),
  (4,'Huber', '', 'Silvia'),
  (5,'Abacherli', '', 'Adrian'),
  (6,'Abacherli', '', 'Adrian'),
  (7,'Meier', '', 'Hans'),
  (8,'Meier', '', 'Urs'),
  (9,'Meyer', '', 'Hans'),
  (10,'Meier', '', 'Urs'),
  (11,'Hermann', '', 'Marco'),
  (12,'Huber', '', 'Milan'),
  (13,'Meyer', '', 'Hans')

Query 1:

;WITH PersonCTE
AS
(
    SELECT ID, SOUNDEX(LastName) AS LastNameSDX, LastName, OriginalName, SOUNDEX(FirstName) FirstNameSDX, FirstName
    FROM Person
    UNION ALL
    SELECT ID, SOUNDEX(OriginalName) AS LastNameSDX, LastName, OriginalName, SOUNDEX(FirstName) FirstNameSDX, FirstName
    FROM Person
    WHERE OriginalName <> ''
),
PersonRankCTE
AS
(
    SELECT DENSE_RANK() OVER (ORDER BY LastNameSDX, FirstNameSdx) AS Grp, * 
    FROM PersonCTE
)
SELECT DENSE_RANK() OVER(ORDER BY grp) AS Grp, ID, LastName, OriginalName, FirstName
FROM PersonRankCTE P1
WHERE (SELECT COUNT(*) FROM PersonRankCTE P2 WHERE P1.grp = P2.grp) > 1

Results:

| GRP | ID |  LASTNAME | ORIGINALNAME | FIRSTNAME |
|-----|----|-----------|--------------|-----------|
|   1 |  5 | Abacherli |              |    Adrian |
|   1 |  6 | Abacherli |              |    Adrian |
|   2 |  3 |     Huber |              |     Milan |
|   2 | 12 |     Huber |              |     Milan |
|   3 |  1 |     Nolte |        Huber |    Silvia |
|   3 |  4 |     Huber |              |    Silvia |
|   4 | 13 |     Meyer |              |      Hans |
|   4 |  9 |     Meyer |              |      Hans |
|   4 |  7 |     Meier |              |      Hans |
|   5 |  8 |     Meier |              |       Urs |
|   5 | 10 |     Meier |              |       Urs |

score 0 · Answer 2 · answered Nov 06 '14 at 13:26

Maybe (probably ?) over complicated, but...

I make two CTE

1 to get all Person fields with corresponding Soundex LastName and OriginalName

1 to create the group and get a GroupNumber. Making an Union All to get, on 1 "column", the "soundexed" LastName and OriginalName (and take only the duplicates)

so

with cte as (select 
                   id, 
                   LastName, 
                   OriginalName, 
                   soundex(LastName) as sdxLastName, 
                   soundex(OriginalName) as sdxOriginalName, 
                   FirstName 
             from Person),

     grp as (select lname, FirstName, row_number() over(order by lname) rn  
             from (
                  select 
                    sdxOriginalName as lname, 
                    FirstName from cte
                  where sdxOriginalName is not null
                  union all 
                  select 
                      sdxLastName as lname, 
                      FirstName from cte) s
                group by lname, FirstName
              having count(*) > 1)
select 
    g.rn as GroupNumber,
    p.Id,
    p.LastName,
    p.OriginalName,
    p.FirstName
from grp g
join cte p on p.firstName = g.FirstName and 
    (sdxLastName = g.lname or sdxOriginalName = g.lname)
order by rn

see Sqlfiddle

SQL Server : Find and Group Duplicates by "weak" criterias

2 Answers2