3

Using SpaCy's rule-based pattern matcher with the '+' operator, I get back both the longest span possible (but also all those within it). I'm wondering if there's a way to return only the longest spans.

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

doc = nlp("I have a BA in English Literature. I received a certificate in Computational Linguistics. I have a Computer Science BA.")

matcher.add("education", None,
            [{'TAG': 'NN'}, {'POS': 'ADP'}, {'POS': 'PROPN', 'OP': '+'}],
            [{'POS': 'PROPN', 'OP': '+'}, {"POS": "NOUN"}])

matches = matcher(doc)

for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

The output is:

BA in English

BA in English Literature

certificate in Computational

certificate in Computational Linguistics

Science BA

Computer Science BA

Is there a simple way to get it to just return the "greediest" spans? (e.g. "BA in English Literature", "certificate in Computational Linguistics", and "Computer Science BA"?

Will
  • 351
  • 4
  • 15
  • It [is said](https://spacy.io/usage/rule-based-matching#quantifiers) that this behavior is a bug and has been fixed in spaCy v2.1.0. – Wiktor Stribiżew Sep 01 '19 at 20:26
  • That is a related issue: there was a bug that the greediness was not always applied consistently. But the output given by OP (all matches) is actually the intended behaviour. – Sofie VL Apr 21 '20 at 09:10

1 Answers1

4

The expected behaviour of the Matcher is to return all possible matches, and I don't think there's a way currently to only get the "greediest". You'll have to filter them by length...

UPDATE: since spaCy v3.0, you can call matcher.add with a greedy argument set to "FIRST" or "LONGEST": https://spacy.io/api/matcher#add

Sofie VL
  • 2,931
  • 2
  • 12
  • 22