2

Using Excel 2010, and the Microsoft "Fuzzy Lookup" add in to compare a column out of 2 worksheets. First worksheet has around 48,000 rows (x 3 columns), second worksheet has around 23,000 rows (x 5 columns). The "Fuzzy Lookup" is comparing one column from each & returning a similarity between the two.

The fuzzy lookup appears to run without a problem, and the results - in most cases - appear to be correct. For example:

W2-NK22/16 in one worksheet shows to have a 0.97 similarity to W2NK2216.

But not in all cases. Some that I expected to have some degree of similarity, instead have 0.000 returned by the Add-In. For example:

761689700000

should have some degree of similarity to:

761689700000EN4239

but the Fuzzy Lookup add in returns 0.000 for it. Both fields are formatted as text. Neither have spaces before or after them, and the first 12 characters are identical.

I have uninstalled & reinstalled the add-in, and have used the default settings. The only other Fuzzy Lookup settings I have changed were in Configure --> Global -- UseApproximateIndexing. I have set it to both False and True which have had no impact.

I have hundreds of examples like the one above that show 0.000 similarity, but upon inspection appear to be very similar. Rows before & after them show various degrees of similarity.

Any thoughts or ideas as to why this doesn't appear to function correctly, or a better way to do this approximate match would be appreciated.

Community
  • 1
  • 1
steve_o
  • 1,243
  • 6
  • 34
  • 60
  • I wonder if it considers each entry one token rather than each letter a token. Change `761689700000EN4239` to `761689700000 EN4239` with a space and run it again. If it thinks they're a match now, it's a token problem. Which is nice to know, but I don't know how to fix it. – Dick Kusleika Oct 12 '15 at 18:20
  • @DickKusleika - I was just reading your (very good) article on this from Jan 2015 - one of the more helpful ones I've found. I did as you suggested (chg ...700000EN4239 to ....700000 EN4239) - and it now returns the correct entry, with a 0.90 similarity. If nothing else, at least I understand why the same 12 characters in each were not a match. It doesn't make a lot of sense to me why each entry is a token rather than each character, since it's determining similarity. I would guess the overhead would be tremendous to make it work by character. – steve_o Oct 12 '15 at 19:48
  • I seems like a good idea to tokenize character-by-character if there are no spaces or other special characters. With all the options that add-in has, I wonder if it has one that will force that situation. Unfortunately, we've reached the limits of my knowledge on this one. None of the settings that I saw looked promising. – Dick Kusleika Oct 12 '15 at 21:11
  • @DickKusleika - Do you have any idea what the Fuzzy Lookup tool uses to determine what a token is? Obviously a space (from your test) - I've got all kinds of characters I which I would presume to indicate a new token (dashes, slashes, periods, commas, etc). I have a UDF in Excel that I've used to put spaces between letters & numbers (the above would be 3 tokens now). You would think that anything not 0-9,A-Z,a-z would be one. Any thoughts? Thanks again for the help! – steve_o Oct 16 '15 at 15:27
  • According to https://msdn.microsoft.com/en-us/library/ms137786.aspx `The transformation provides a default set of delimiters used to tokenize the data, but you can add token delimiters to suit the needs of your data.` but I don't know how or where you do it. See also http://blogs.adatis.co.uk/blogs/jeremykashel/archive/2013/07/05/ssis-fuzzy-lookup-transformation-tokens.aspx – Dick Kusleika Oct 19 '15 at 22:21
  • @DickKusleika - An update. I did some things to split the strings out, by delimiting them (with a space) where I had numbers & letters adjacent to each other. In other words, 700000EN4239 became 700000 EN 4239. The resulting fuzzy lookup nearly doubled the amount of non-zero matches. The down side to that was that there were a LOT more false matches, and some matches from earlier with a 0.90+ similarity where now about 0.80. The number of 1.0 matches were exactly the same in both. We ended up using the expanded list simply to check obvious mismatches from the earlier one. Thx for the help! – steve_o Nov 04 '15 at 14:52

1 Answers1

1

Trying to add content even though this case is 2 years old. Hopefully someone else can use it.

For Transformations, Tokenization, etc - look in the same folder where Fuzzy Lookup is installed. There is an example file there called Portfolio.xlsx and a corresponding Readme.docx file. Those are very helpful. Frankly the documentation on the Fuzzy Lookup add-in is terrible (but it is free). The Readme file talks about an entitlement called "EditTransformationProvider" that might help this kind of problem.

I've implemented Fuzzy on a couple processes at my work and we have saved hundreds of man-hours when working in Excel. It's no joke.