2

I've got a list of company names that I am trying to parse from a large number of PDF documents.

I've forced the PDFs through Apache Tika to extract the raw text, and I've got the list of 200 companies read in.

I'm stuck trying to use some combination of FuzzyWuzzy and Spacy to extract the required matches.

This is as far as I've gotten:

import spacy
from fuzzywuzzy import fuzz, process

nlp = spacy.load("en_core_web_sm")
doc = nlp(strings[1])

companies = []
candidates = []

for ent in doc.ents:
  if ent.label_ == "ORG":
    candidates.append(ent.text)

process.extractBests(company_name, candidates, score_cutoff=80)

What I'm trying to do is:

  1. Read through the document string
  2. Parse for any fuzzy company name matches scoring say 80+
  3. Return company names that are contained in the document and their scores.

Help!

Jack McPherson
  • 135
  • 1
  • 8
  • 2
    Have you seen this thread? https://support.prodi.gy/t/fuzzy-partial-matching-with-phrasematcher-ner-task/1084/8 There are a few posts with GitHub links to working code combining these two libraries. – APhillips Jan 29 '20 at 02:10
  • Hey mate, I did see that one but I bounced off it. Am I missing something in how this can be used to solve this? – Jack McPherson Jan 29 '20 at 02:43

1 Answers1

1

This is the way I populated candidates -- mpg is a Pandas DataFrame:

for s in mpg['name'].values: 
    doc = nlp(s) 
    for ent in doc.ents: 
        if ent.label_ == 'ORG': 
            candidates.append(ent.text) 

Then let's say we have a short list of car data just to test with:

candidates = ['buick'
             ,'buick skylark'
             ,'buick estate wagon'
             ,'buick century']

The below method uses fuzz.token_sort_ratio which is described as "returning a measure of the sequences' similarity between 0 and 100 but sorting the token before comparing." Try out some of the ones partially documented here: https://github.com/seatgeek/fuzzywuzzy/issues/137

results = {} # dictionary to store results 
companies = ['buick'] # you'll have more companies
for company in companies:
    results[company] = process.extractBests(company,candidates,
                                            scorer=fuzz.token_sort_ratio,
                                            score_cutoff=50)

And the results are:

In [53]: results                                                                
Out[53]: {'buick': [('buick', 100), 
                    ('buick skylark', 56), 
                    ('buick century', 56)]}

In this case using 80 as a cutoff score would work better than 50.

mechanical_meat
  • 163,903
  • 24
  • 228
  • 223