I'm using SQL Server Integration Services in Visual Studio 2012 and I'm trying to find similar addresses that are referenced by different customers using the Fuzzy Grouping component. Here's some sample data (SQL Fiddle):
CREATE TABLE [tblAddresses] (
[AddressId] [int] IDENTITY(1,1) NOT NULL
,[AddressName] [nvarchar](20) NOT NULL
,[Street] [nvarchar](20) NULL
,[ZipCode] [nvarchar](10) NULL
,[Town] [nvarchar](20) NULL
,[CustomerId] [int] NOT NULL
)
INSERT INTO [tblAddresses] (
[AddressName]
,[Street]
,[ZipCode]
,[Town]
,[CustomerId]
) VALUES
(
'John Doe'
,'1447 Example Street'
,'28815'
,'Smallville'
,1
),
(
'Jane Doe'
,'1447 Example Str.'
,'28815'
,'Smallville'
,1
),
(
'Jim Doe'
,'1447 Example Str.'
,'28815'
,'Smallville'
,2
),
(
'Jack Domino'
,'815 Broadway Blvd.'
,'10000'
,'Capital City'
,2
),
(
'Joan Doe'
,'1447 Example Street'
,'28815'
,'Smallville'
,3
)
I have done this before and it's no problem comparing all the rows without further restrictions. Today, however, I only want to compare rows that have different values inside the CustomerId
column. So regarding my sample data I'm not interested in any similarities between line 1 and 2 and neither in any similarities between line 3 and 4. What I do want to find out is:
- What are the chances of line 3 being a duplicate (or "close relative") of line 1 or line 5?
- What are the chances of line 4 being a duplicate (or "close relative") of line 1 or line 5?
- What are the chances of line 5 being a duplicate (or "close relative") of line 1 or line 3?
Is there a way of integrating the "different CustomerId value" requirement within the Fuzzy Grouping component?
Thanks in advance.