2

I have a list of properly-formatted company names, and I am trying to find when those companies appear in a document. The problem is that they are unlikely to appear in the document exactly as they do in the list. For example, Visa Inc may appear as Visa or American Airlines Group Inc may appear as American Airlines.

How would I go about iterating over the entire contents of the document and then return the properly formatted company name when a close match is found?

I have tried both fuzzywuzzy and difflib.get_close_matches, but the problem is it looks at each individual word rather than clusters of words:

from fuzzywuzzy import process
from difflib import get_close_matches

company_name = ['American Tower Inc', 'American Airlines Group Inc', 'Atlantic American Corp', 'American International Group']

text = 'American Tower is one company. American Airlines is another while there is also Atlantic American Corp but we cannot forget about American International Group Inc.'

#using fuzzywuzzy
for word in text.split():
    print('- ' + word+', ', ', '.join(map(str,process.extractOne(word, company_name))))

#using get_close_matches
for word in text.split():
    match = get_close_matches(word, company_name, n=1, cutoff=.4)
    print(match)

user53526356
  • 934
  • 1
  • 11
  • 25
  • 1
    What is the optional part in the company name which if not present in the text would also be considered for matching. Will it be right if I say `Inc` or `Group Inc` or `Corp` is optional and the company name should match with or without it? So let's say in text we find `American International` then I guess you would be okay in matching it. And I don't think it will be okay to just match `American` alone as the context may be broad then. Can you clarify a bit on this? – Pushpesh Kumar Rajwanshi May 15 '19 at 18:40
  • Yes that's correct—inc, corp, etc can probably be ignored if it stands by itself. But ```Incyte Corp``` should still be matched against, even though it contains ```Inc``` in the name. Also, all company names will be capitalized, so I think the solution would likely have to use some form of regex? – user53526356 May 15 '19 at 19:59

2 Answers2

2

I was working on a similar problem. Fuzzywuzzy internally uses difflib and both of them perform slowly on large datasets.

Chris van den Berg's pipeline converts company names into vectors of 3-grams using a TF-IDF matrix and then compares the vectors using cosine similarity.

The pipeline is quick and gives accurate results for partially matched strings too.

Viseshini Reddy
  • 744
  • 3
  • 13
  • Thanks, I had come across this page a few days ago during my search but it's admittedly above my head. It seems promising, but looks like it's comparing one list to another list. So how would I achieve looping through an entire text block/file and try to find all the company names that exist on the existing list? – user53526356 May 17 '19 at 14:49
  • 1
    You just have to understand how the TF-IDF matrix is being calculated. Can you extract all the Noun Phrase from a document? Even if you extract words like `company` the pipeline will give you a low score when you're comparing with `company_name` list. – Viseshini Reddy May 20 '19 at 06:23
  • Yeah just extracting titlecase words gets me somewhat close, and then I was hoping to filter those if there is a close match to ```company_name``` above a certain match threshold. But I still get the same problem where some company names are one word (e.g., Visa) whereas others are multiple (e.g., American Airlines vs American Tower). In the latter case, I'm stuck with how to find the closest match to the ```company_name``` list. – user53526356 May 20 '19 at 13:48
0

For that type of task I use a record linkage algorithm, it will find those clusters for you with the help of ML. You will have to provide some actual examples so the algorithm can learn to label the rest of your dataset properly.

Here is some info: https://pypi.org/project/pandas-dedupe/

Cheers,

Shogun187
  • 88
  • 1
  • 8