2

I'm keeping this language agnostic, as I'm open to any platform that can provide a solution. My current implementation is in Excel/VBA, but I'm investigating Python, JavaScript and SSMS.

Are there any existing methods for leveraging a collection of known string variants (ex. abbreviations, acronyms)? These are equivalents that SoundEx or edit distance are unlikely to catch as significantly close, but have been predefined and collected in table form.

Assuming the fields are specific enough to avoid false equivalencies (ex. defining 'US' in a field for Country), is there a way to take these into consideration with fuzzy matching?

The closest solution I've been able to implement is a RegEx match-substitution with an index key.

For example:

DEPARTMENT OF | DEPT. OF | DEPARTMENT | DPT. OF | DEPT. | DEPT | DPT
MEDICINE | MED. | MED

With the record set of

RECORD 1 | DEPARTMENT OF MED.
RECORD 2 | DPT. MEDICINE
RECORD 3 | DEPT OF MED

And the pseudo-code

FIND '(\b(DEPARTMENT\sOF|DEPARTMENT|DEPT.\sOF|DEPT\.|DEPT\sOF|DPT\.\sOF|DPT\.|DPT)(?=\b))'
      REPLACE '$matchSet001'
FIND '(\b(MEDICINE|MED\.|MED)(?=\b))'
      REPLACE '$matchSet002'

The results would be line up as identical records (as intended)

RECORD 1 | $matchSet001 $matchSet002
RECORD 2 | $matchSet001 $matchSet002
RECORD 3 | $matchSet001 $matchSet002

This has worked, but has serious limitations. In any larger scale (1000+ equivalents) this is prohibitively time consuming, as it requires manually implementing each variation with an eye to ordering. It also transforms the string and prevents actual fuzzy matching in subsequent steps.

Edit: This paper appears to address the same issue in specific context (researcher names) but I couldn't find any working implementations.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
ghostrobot
  • 43
  • 4

0 Answers0