I've got a list of company names that I am trying to parse from a large number of PDF documents.
I've forced the PDFs through Apache Tika to extract the raw text, and I've got the list of 200 companies read in.
I'm stuck trying to use some combination of FuzzyWuzzy and Spacy to extract the required matches.
This is as far as I've gotten:
import spacy
from fuzzywuzzy import fuzz, process
nlp = spacy.load("en_core_web_sm")
doc = nlp(strings[1])
companies = []
candidates = []
for ent in doc.ents:
if ent.label_ == "ORG":
candidates.append(ent.text)
process.extractBests(company_name, candidates, score_cutoff=80)
What I'm trying to do is:
- Read through the document string
- Parse for any fuzzy company name matches scoring say 80+
- Return company names that are contained in the document and their scores.
Help!