Algorithm used by Fuzzy Group

Question

The company I work for purchased data cleansing and matching software to cleanse and match information every night. It takes about fifteen hours to run.

I have discovered the Fuzzy Group/Fuzzy Lookup component in SSIS, which is extremely fast in my experience by comparison. I have some questions:

What algorithms do these components use? I have read articles that suggest they use: Soundex, variations of soundex, QGrams and Levenstein Distance or a combination of the four. Is there any documentation, which explicitly specified which algorithm they use?

score 0 · Answer 1 · answered Apr 29 '15 at 01:07

This page from Microsoft Research describes these at a high level http://research.microsoft.com/en-us/projects/datacleaning/

I think the 2nd-last link has a full description: http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf

It's way over my head, but it reads like they rolled their own algo.

Algorithm used by Fuzzy Group

1 Answers1