SSIS: Fuzzy Grouping only for specific rows

Question

I'm using SQL Server Integration Services in Visual Studio 2012 and I'm trying to find similar addresses that are referenced by different customers using the Fuzzy Grouping component. Here's some sample data (SQL Fiddle):

    CREATE TABLE [tblAddresses] (
      [AddressId] [int] IDENTITY(1,1) NOT NULL
      ,[AddressName] [nvarchar](20) NOT NULL
      ,[Street] [nvarchar](20) NULL
      ,[ZipCode] [nvarchar](10) NULL
      ,[Town] [nvarchar](20) NULL
      ,[CustomerId] [int] NOT NULL
    )

    INSERT INTO [tblAddresses] (
      [AddressName]
      ,[Street]
      ,[ZipCode]
      ,[Town]
      ,[CustomerId]
    ) VALUES 
    (
      'John Doe'
      ,'1447 Example Street'
      ,'28815'
      ,'Smallville'
      ,1
    ), 
    (
      'Jane Doe'
      ,'1447 Example Str.'
      ,'28815'
      ,'Smallville'
      ,1
    ), 
    (
      'Jim Doe'
      ,'1447 Example Str.'
      ,'28815'
      ,'Smallville'
      ,2
    ), 
    (
      'Jack Domino'
      ,'815 Broadway Blvd.'
      ,'10000'
      ,'Capital City'
      ,2
    ), 
    (
      'Joan Doe'
      ,'1447 Example Street'
      ,'28815'
      ,'Smallville'
      ,3
    )

I have done this before and it's no problem comparing all the rows without further restrictions. Today, however, I only want to compare rows that have different values inside the CustomerId column. So regarding my sample data I'm not interested in any similarities between line 1 and 2 and neither in any similarities between line 3 and 4. What I do want to find out is:

What are the chances of line 3 being a duplicate (or "close relative") of line 1 or line 5?
What are the chances of line 4 being a duplicate (or "close relative") of line 1 or line 5?
What are the chances of line 5 being a duplicate (or "close relative") of line 1 or line 3?

Is there a way of integrating the "different CustomerId value" requirement within the Fuzzy Grouping component?

Thanks in advance.

Mike Honey · Answer 1 · 2014-04-09T22:47:55.573

1

AFAIK the component doesnt support this.

Instead I would use a Fuzzy Lookup transformation.

On the Advanced page I would increase Maximum number of matches to output per lookup and decrease the Similarity threshold, then dump the result into a SQL table, then query the output table with a WHERE clause that excludes CustomerID matches.

edited Apr 09 '14 at 22:47

answered Apr 09 '14 at 06:52

Mike Honey

14,523
1
24
40

Thanks. Within the Advanced page of the Fuzzy _Grouping_ component I can't see a number of matches to output per lookup. This seems to be a Fuzzy _Lookup_ component thing only. But I might be able to work with the Lookup component; I'll give it a try and come back here afterwards. – Onkel Toob Apr 09 '14 at 10:02

SSIS: Fuzzy Grouping only for specific rows

1 Answers1