I'm trying to create noun chunks using the spacy pattern matcher. For example, if I have a sentence "The ice hockey scrimmage took hours." I want to return "ice hockey scrimmage" and "hours". I currently have this:
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("NounChunks", None, [{"POS": "NOUN"}, {"POS": "NOUN", "OP": "*"}, {"POS": "NOUN", "OP": "*"}] )
doc = nlp("The ice hockey scrimmage took hours.")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match_id, string_id, start, end, span.text)
But it is returning all versions of "ice hockey scrimmage" and not just the longest.
12482938965902279598 NounChunks 1 2 ice
12482938965902279598 NounChunks 1 3 ice hockey
12482938965902279598 NounChunks 2 3 hockey
12482938965902279598 NounChunks 1 4 ice hockey scrimmage
12482938965902279598 NounChunks 2 4 hockey scrimmage
12482938965902279598 NounChunks 3 4 scrimmage
12482938965902279598 NounChunks 5 6 hours
Is there something I'm missing in how to define the pattern? I want it to return only:
12482938965902279598 NounChunks 1 4 ice hockey scrimmage
12482938965902279598 NounChunks 5 6 hours