0

The company I work for purchased data cleansing and matching software to cleanse and match information every night. It takes about fifteen hours to run.

I have discovered the Fuzzy Group/Fuzzy Lookup component in SSIS, which is extremely fast in my experience by comparison. I have some questions:

What algorithms do these components use? I have read articles that suggest they use: Soundex, variations of soundex, QGrams and Levenstein Distance or a combination of the four. Is there any documentation, which explicitly specified which algorithm they use?

w0051977
  • 15,099
  • 32
  • 152
  • 329

1 Answers1

0

This page from Microsoft Research describes these at a high level http://research.microsoft.com/en-us/projects/datacleaning/

I think the 2nd-last link has a full description: http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf

It's way over my head, but it reads like they rolled their own algo.

Mike Honey
  • 14,523
  • 1
  • 24
  • 40