I have a list of properly-formatted company names, and I am trying to find when those companies appear in a document. The problem is that they are unlikely to appear in the document exactly as they do in the list. For example, Visa Inc
may appear as Visa
or American Airlines Group Inc
may appear as American Airlines
.
How would I go about iterating over the entire contents of the document and then return the properly formatted company name when a close match is found?
I have tried both fuzzywuzzy
and difflib.get_close_matches
, but the problem is it looks at each individual word rather than clusters of words:
from fuzzywuzzy import process
from difflib import get_close_matches
company_name = ['American Tower Inc', 'American Airlines Group Inc', 'Atlantic American Corp', 'American International Group']
text = 'American Tower is one company. American Airlines is another while there is also Atlantic American Corp but we cannot forget about American International Group Inc.'
#using fuzzywuzzy
for word in text.split():
print('- ' + word+', ', ', '.join(map(str,process.extractOne(word, company_name))))
#using get_close_matches
for word in text.split():
match = get_close_matches(word, company_name, n=1, cutoff=.4)
print(match)