I try to list and group some potential duplicates from a Person
table.
The schema looks like this:
Id LastName OriginalName FirstName
---------------------------------------------
1 Nolte Huber Silvia
2 Nolte Johann
3 Huber Milan
4 Huber Silvia
5 Abacherli Adrian
6 Abächerli Adrian
7 Meier Hans
8 Meier Urs
9 Meyer Hans
10 Meier Urs
11 Hermann Marco
12 Huber Milan
13 Meyer Hans
Expected result:
GroupNumber Id LastName OriginalName FirstName
-----------------------------------------------------------
1 5 Abacherli Adrian
1 6 Abächerli Adrian
2 3 Huber Milan
2 12 Huber Milan
3 4 Huber Silvia
3 1 Nolte Huber Silvia
4 7 Meier Hans
4 9 Meyer Hans
4 13 Meyer Hans
5 8 Meier Urs
5 10 Meier Urs
Explanation:
I want to group rows that are near matches and list them in a grid of a web-application (ASP.NET MVC). A considered duplicate must have at least:
- same
LastName
and sameFirstName
OR LastName
likeOrginalName
and sameFirstName
To make things more complicated, "same" means a phonetic match (i.e via SOUNDEX
or a similar function): Meyer
== Meier
== meier
.
Technology in use:
- Microsoft SQL Server 2012
- Telerik DataAccess ORM
- .NET Framework 4.5, C#
Expected answer:
- A pure SQL query OR
- A stored procedure OR
- A combination of a SQL query / SP and LINQ query for the ORM in C#
All approaches that I have worked out so far, missing the GroupNumber
. This is such a (non-working) query:
SELECT
Id, LastName, FirstName
FROM
Person p1,
(SELECT
p1.Id AS Id1
FROM Person p1
INNER JOIN Person p2
ON (p1.LastName LIKE p2.LastName OR p1.LastName LIKE p2.OriginalName)
AND p1.FirstName LIKE p2.FirstName AND p1.Id <> p2.Id
GROUP BY p1.Id
HAVING COUNT(*) > 1) AS p2
WHERE
p1.Id IN (SELECT Id1)
ORDER BY
p1.LastName, FirstName, Id