3

We're doing a LOT of work towards trying to reconcile about 1,000 duplicate manufacturer names and 1,000,000 duplicate part numbers. One thing that has come up is how to "match" things like "Limited" vs. "Ltd." vs. "Ltd"

The purpose is for the application to reconcile these matched items into a standard format. So:

ACME Ltd. ACME Limited ACME Ltd

Should all be reconciled into ACME Ltd.

This will also be used to prevent entering additional duplicates in the future.

Any suggestions on how to accomplish this pattern matching in SQL Server? Any known algorithms to find items with mapped equivalencies, etc...?

Thanks!

Eric.

bopapa_1979
  • 8,949
  • 10
  • 51
  • 76
  • 1
    I don't know about the part numbers (because you haven't given any examples), but for the company names, I'd imagine you'd solve the problem more quickly and more reliably by simply printing out a sorted list, and doing it by hand... – Oliver Charlesworth Jan 19 '11 at 23:45
  • How are your part numbers different? – Abe Miessler Jan 19 '11 at 23:54
  • @Oli Charlesworth: good point, but the problem will reoccur when someone enters the wrong name. Augment with regex match and replace in the input field. – Fred Foo Jan 19 '11 at 23:59

2 Answers2

3

How about a table that lists what you want in one column and variations in the next?

Ltd   Limited 
Ltd   Ltd.
St    Street
St    Str.

Then, if you find a match on the second column, you change it to the first. It may take several iterations, as you find other alternatives.

thursdaysgeek
  • 7,696
  • 20
  • 78
  • 115
  • This wound up working really well with a single select statement inside a function: SELECT @InputString = REPLACE(@InputString, c.Original, c.Standard) FROM StdCorpNames c. There are some other complexities, like operating only on whole words, assuming things like Inc or Ltd are at the end of the string, etc... but this is a really nice basic solution. Thanks! – bopapa_1979 Jan 20 '11 at 17:48
2

Using SQL Server Full Text Search you can use synonyms:

For each full-text language, SQL Server also provides a file in which you can optionally define language-specific synonyms to extend the scope of search queries (a thesaurus file).

In your case you could add a section like the following:

 <expansion>
         <sub>Limited</sub>
         <sub>Ltd</sub>
         <sub>Ltd.</sub>
 </expansion>

Here is a link that goes into more detail on how to modify the thesaurus file. This may work for what you are trying to do...

SQL Server also offers some limited pattern matching by using LIKE. I would recommend looking over the options it offers to see if they will be sufficient for your needs.

If LIKE is insufficient you can always look at creating a CLR stored procedure or UDFs that will allow you to use regular expressions. This will allow you to match MUCH more complex patters...

Abe Miessler
  • 82,532
  • 99
  • 305
  • 486