Fuzzy matching from string candidate list

Question

I've got a list of company names that I am trying to parse from a large number of PDF documents.

I've forced the PDFs through Apache Tika to extract the raw text, and I've got the list of 200 companies read in.

I'm stuck trying to use some combination of FuzzyWuzzy and Spacy to extract the required matches.

This is as far as I've gotten:

import spacy
from fuzzywuzzy import fuzz, process

nlp = spacy.load("en_core_web_sm")
doc = nlp(strings[1])

companies = []
candidates = []

for ent in doc.ents:
  if ent.label_ == "ORG":
    candidates.append(ent.text)

process.extractBests(company_name, candidates, score_cutoff=80)

What I'm trying to do is:

Read through the document string
Parse for any fuzzy company name matches scoring say 80+
Return company names that are contained in the document and their scores.

Help!

Have you seen this thread? https://support.prodi.gy/t/fuzzy-partial-matching-with-phrasematcher-ner-task/1084/8 There are a few posts with GitHub links to working code combining these two libraries. — APhillips, Jan 29 '20 at 02:10
Hey mate, I did see that one but I bounced off it. Am I missing something in how this can be used to solve this? — Jack McPherson, Jan 29 '20 at 02:43

mechanical_meat · Accepted Answer · 2020-01-29T03:16:23.123

1

This is the way I populated candidates -- mpg is a Pandas DataFrame:

for s in mpg['name'].values: 
    doc = nlp(s) 
    for ent in doc.ents: 
        if ent.label_ == 'ORG': 
            candidates.append(ent.text)

Then let's say we have a short list of car data just to test with:

candidates = ['buick'
             ,'buick skylark'
             ,'buick estate wagon'
             ,'buick century']

The below method uses fuzz.token_sort_ratio which is described as "returning a measure of the sequences' similarity between 0 and 100 but sorting the token before comparing." Try out some of the ones partially documented here: https://github.com/seatgeek/fuzzywuzzy/issues/137

results = {} # dictionary to store results 
companies = ['buick'] # you'll have more companies
for company in companies:
    results[company] = process.extractBests(company,candidates,
                                            scorer=fuzz.token_sort_ratio,
                                            score_cutoff=50)

And the results are:

In [53]: results                                                                
Out[53]: {'buick': [('buick', 100), 
                    ('buick skylark', 56), 
                    ('buick century', 56)]}

In this case using 80 as a cutoff score would work better than 50.

edited Jan 29 '20 at 03:16

answered Jan 29 '20 at 02:47

mechanical_meat

163,903
24
228
223

Hi there, Thanks for your response, you can see where I'm going. what if there was multiple criteria for the "buick" spot? can [AAA] be an iterable through a second list?`process.extractBests([AAA],candidates,scorer=fuzz.token_sort_ratio,score_cutoff=50)` – Jack McPherson Jan 29 '20 at 03:01
You're welcome. I updated the answer with a loop through the companies. Let me know if that helps out. – mechanical_meat Jan 29 '20 at 03:04
Thanks mate. I'll try this when I get home from lunch and will mark as correct if it runs. Good on you. – Jack McPherson Jan 29 '20 at 03:13
No worries. I'll add another part to the answer so you have a good way to store the results. – mechanical_meat Jan 29 '20 at 03:14
Perfect. Thanks again. – Jack McPherson Jan 30 '20 at 01:14
Cheers, happy to be of assistance. – mechanical_meat Jan 30 '20 at 01:44

Fuzzy matching from string candidate list

1 Answers1