0

To make things simple lets say I have a Client table with Fields:

  • ClientID
  • PCode
  • Region

I have a look up Region table with Fields:

  • ID
  • PostCode
  • Region

The Client table has one row :

123, 3075, THOMASTOWN

The Region has 2 rows:

1,3074,THOMASTOWN
2,3075,LALOR

I am trying to cleanse some data using SSIS fuzzy lookup.

I use the Client as source and the lookup as Reference. When the Max number of matches to output per lookup is 1 the result is 123,3075,THOMASTOWN, 1,3074,THOMASTOWN. In this case SSIS prefers the value of region over the value of pcode. But when I increase this option to a higher (2, 4, or 100) all the result rows are the same as the previous one.

I expect when I increase the number the other row of the lookup table shows up as one of the matches because the row has the same region code as the Client row.

To my surprise when for the first time I increased the option from 2 to 4 I saw the expected result (the other lookup record as a match) in 2 out of 4 rows of the fuzzy lookup output but since then it has never happened again and I always see the exactly same records as per different values of Max number of matches option.

Can anyone explain to me what's happening here and if I am doing something wrong?

billinkc
  • 59,250
  • 9
  • 102
  • 159
Jami
  • 579
  • 6
  • 20

1 Answers1

0

From BOL: http://msdn.microsoft.com/en-us/library/ms137786.aspx

The transformation returns zero or more matches up to the number of matches specified. 
Specifying a maximum number of matches does not guarantee that the transformation 
returns the maximum number of matches; it only guarantees that the transformation 
returns at most that number of matches. 

If you are using these two columns in your matching, the region seems to be different enough that the pcode is still not a strong enough match on its own to return any results.

There are a couple of options:

  1. Try tuning the similarity threshold and see if you can get the row returned
  2. Use two fuzzy lookups (one for each column), so that if the first does not match or the similarity % is not strong enough, you can perform a second lookup to make the match.
Mark Wojciechowicz
  • 4,287
  • 1
  • 17
  • 25
  • What surprises me here is the result the process generates for just the fist time. Today I run the package in another machine and got the expected result similar to the first time I run the package in another machine as I highlighted in my question. To me it shows that the under-the-hood algorithm here acts in a bit non-deterministic way. Although the end result of the process are fuzzy but it does not mean that the process itself should be non-deterministic as well I suppose. – Jami Sep 16 '14 at 01:38
  • I have found the problem. I use SQL Destination for storing my original lookup access table. I didn't know that SSIS does not remove all the rows from the table upon each run. Now it seems to work fine. But I have noticed a minor problem. When I set the max number of matches to a big number like 100 the number of matched row result reduces to 1 record which does not have a high similarity value! But when I set it to 2 the result is 2 records which is acceptable. – Jami Sep 17 '14 at 05:42