1

I have 12 Million company names in my db. I want to match them with a list offline. I want to know the best algorithm to do so. I have done that through Levenstiens distance but it is not giving the expected results. Could you please suggest some algorithms for the same.Problem is matching the companies like

G corp. ----this need to be mapped to G corporation
water Inc -----Water Incorporated
shashank
  • 400
  • 8
  • 25

3 Answers3

2

You should probably start by expanding the known suffixes in both lists (the database and the list). This will take some manual work to figure out the correct mapping, e.g. with regexps:

  • \s+inc\.?$ -> Incorporated
  • \s+corp\.?$ -> Corporation

You may want to do other normalization as well, such as lower-casing everything, removing punctuation, etc.

You can then use Levenshtein distance or another fuzzy matching algorithm.

AKX
  • 152,115
  • 15
  • 115
  • 172
  • There are hundreds of such normalizations. So i am expecting some algorithm that can deal with such cases – shashank Aug 20 '18 at 12:36
  • 1
    @Shashank: The fact that "corp." means "Corporation" is a human construct, not a logical one. Thus, the *only* algorithm to detect this is by manually telling it. Of course, there may be better or worse ways to do this thing, but anything remotely functional will be some play on this solution. – Him Aug 20 '18 at 12:49
2

You can use fuzzyset, put all your companies names in the fuzzy set and then match a new term to get matching scores. An example :

import fuzzyset

fz = fuzzyset.FuzzySet()
#Create a list of terms we would like to match against in a fuzzy way
for l in ["Diane Abbott", "Boris Johnson"]:
    fz.add(l)

#Now see if our sample term fuzzy matches any of those specified terms
sample_term='Boris Johnstone'
fz.get(sample_term), fz.get('Diana Abbot'), fz.get('Joanna Lumley')

Also, if you want to work with semantics, instead of just the string( which works better in such scenarios ), then have a look at spacy similarity. An example from the spacy docs:

import spacy

nlp = spacy.load('en_core_web_md')  # make sure to use larger model!
tokens = nlp(u'dog cat banana')

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))
Deepak Saini
  • 2,810
  • 1
  • 19
  • 26
  • Fuzzyset is taking hours to load the 12 million companies. Is there any other package which might help ? – shashank Aug 25 '18 at 12:44
-1

Use MatchKraft to fuzzy match company names on two lists.

http://www.matchkraft.com/

Levenstiens distance is not enough to solve this problem. You also need the following:

  1. Heuristics to improve execution time
  2. Information retrieval (Lucene) and SQL
  3. Company names database

It is better to use an existing tool rather than creating your program in Python.