Using SpaCy's rule-based pattern matcher with the '+' operator, I get back both the longest span possible (but also all those within it). I'm wondering if there's a way to return only the longest spans.
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
doc = nlp("I have a BA in English Literature. I received a certificate in Computational Linguistics. I have a Computer Science BA.")
matcher.add("education", None,
[{'TAG': 'NN'}, {'POS': 'ADP'}, {'POS': 'PROPN', 'OP': '+'}],
[{'POS': 'PROPN', 'OP': '+'}, {"POS": "NOUN"}])
matches = matcher(doc)
for match_id, start, end in matches:
# Get the matched span
matched_span = doc[start:end]
print(matched_span.text)
The output is:
BA in English
BA in English Literature
certificate in Computational
certificate in Computational Linguistics
Science BA
Computer Science BA
Is there a simple way to get it to just return the "greediest" spans? (e.g. "BA in English Literature", "certificate in Computational Linguistics", and "Computer Science BA"?