4

I am trying to measure the similarity of company names, however I am having difficulties while I'm trying to match the abbreviations for those names. For example:

IBM
The International Business Machines Corporation

I have tried using fuzzywuzzy to measure the similarity:

>>> fuzz.partial_ratio("IBM","The International Business Machines Corporation")
33
>>> fuzz.partial_ratio("General Electric","GE Company")
20
>>> fuzz.partial_ratio("LTCG Holdings Corp","Long Term Care Group Inc")
39
>>> fuzz.partial_ratio("Young Innovations Inc","YI LLC")
33

Do you know any techniques to measure a higher similarity for such abbreviations?

  • My suggestion (based on your examples above), would be to redefine your queries as strings that contain the first letters of each word in the string, then perform your matching using those strings. Will post an answer, stay tuned. – rahlf23 Jul 18 '18 at 15:44
  • Question, is your LTCG Holdings Corp and Long Term Care Group Inc. example reversed by chance? – rahlf23 Jul 18 '18 at 15:46
  • @rahlf23 I do not know whether the incoming data is an abbreviation or not. I am doing this for a large amount of data. That's why I do not know the order and the types. – Charlotte von La Roche Jul 18 '18 at 15:51
  • See my answer. You could also think about removing words like 'The', 'Corporation', 'Inc', etc. in a pre-processing step to improve the accuracy of the abbreviations you generate. – rahlf23 Jul 18 '18 at 16:03

1 Answers1

6

This seems to produce much better results for the set of examples above:

from fuzzywuzzy import fuzz, process

companies = ['The International Business Machines Corporation','General Electric','Long Term Care Group','Young Innovations Inc']
abbreviations = ['YI LLC','LTCG Holdings Corp','IBM','GE Company']

queries = [''.join([i[0] for i in j.split()]) for j in companies]

for company in queries:
    print(company, process.extract(company, abbreviations, scorer=fuzz.partial_token_sort_ratio))

This yields:

TIBMC [('IBM', 100), ('LTCG Holdings Corp', 50), ('YI LLC', 29), ('GE Company', 20)]
GE [('GE Company', 100), ('LTCG Holdings Corp', 50), ('YI LLC', 0), ('IBM', 0)]
LTCG [('LTCG Holdings Corp', 100), ('YI LLC', 50), ('GE Company', 25), ('IBM', 0)]
YII [('YI LLC', 80), ('LTCG Holdings Corp', 33), ('IBM', 33), ('GE Company', 33)]

A small modification to the for loop:

for query, company in zip(queries, companies):
    print(company, '-', process.extractOne(query, abbreviations, scorer=fuzz.partial_token_sort_ratio))

Gives:

The International Business Machines Corporation - ('IBM', 100)
General Electric - ('GE Company', 100)
Long Term Care Group - ('LTCG Holdings Corp', 100)
Young Innovations Inc - ('YI LLC', 80)
rahlf23
  • 8,869
  • 4
  • 24
  • 54