Spacy Matcher - Only Match Longest String

Question

I'm trying to create noun chunks using the spacy pattern matcher. For example, if I have a sentence "The ice hockey scrimmage took hours." I want to return "ice hockey scrimmage" and "hours". I currently have this:

from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")

matcher = Matcher(nlp.vocab)
matcher.add("NounChunks", None,  [{"POS": "NOUN"}, {"POS": "NOUN", "OP": "*"}, {"POS": "NOUN", "OP": "*"}] )

doc = nlp("The ice hockey scrimmage took hours.")
matches = matcher(doc)

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id] 
    span = doc[start:end]  
    print(match_id, string_id, start, end, span.text)

But it is returning all versions of "ice hockey scrimmage" and not just the longest.

12482938965902279598 NounChunks 1 2 ice
12482938965902279598 NounChunks 1 3 ice hockey
12482938965902279598 NounChunks 2 3 hockey
12482938965902279598 NounChunks 1 4 ice hockey scrimmage
12482938965902279598 NounChunks 2 4 hockey scrimmage
12482938965902279598 NounChunks 3 4 scrimmage
12482938965902279598 NounChunks 5 6 hours

Is there something I'm missing in how to define the pattern? I want it to return only:

12482938965902279598 NounChunks 1 4 ice hockey scrimmage
12482938965902279598 NounChunks 5 6 hours

Raqib · Accepted Answer · 2020-07-08T19:14:26.990

I do not know of an in-built way to filter out the longest span, but there is an utility functionspacy.util.filter_spans(spans) which helps with this. It chooses the longest span among the given spans and if multiple overlapping spans have the same length, it gives priority to the span which occurs first in the list of spans.

import spacy 

from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")

matcher = Matcher(nlp.vocab)
matcher.add("NounChunks", None,  [{"POS": "NOUN", "OP": "+"}] )

doc = nlp("The ice hockey scrimmage took hours.")
matches = matcher(doc)

spans = [doc[start:end] for _, start, end in matches]
print(spacy.util.filter_spans(spans))

Output

[ice hockey scrimmage, hours]

Spacy Matcher - Only Match Longest String

1 Answers1