2

I work on an NLP project and i have to use spacy and spacy Matcher to extract all named entities who are nsubj (subjects) and the verb to which it relates : the governor verb of my NE nsubj. Example :

Georges and his friends live in Mexico City
"Hello !", says Mary

I'll need to extract "Georges" and "live" in the first sentence and "Mary" and "says" in the second one but i don't know how many words will be between my named entity and the verb to which it relate. So i decided to explore spacy Matcher more. So i'm struggling to write a pattern on Matcher to extract my 2 words. When the NE subj is before the verb, i get good results but i don't know how to write a pattern to match a NE subj after words which it correlates to. I could also, according to the guideline, do this task with "regular spacy" but i don't know how to do that. The problem with Matcher concerns the fact that i can't manage the type of dependency between the NE and VERB and grab the good VERB. I'm new with spacy, i've always worked with NLTK or Jieba (for chineese). I don't know even how to tokenize a text in sentence with spacy. But i chose to split the whole text in sentences to avoir bad matching between two sentences. Here is my code

import spacy
from nltk import sent_tokenize
from spacy.matcher import Matcher

nlp = spacy.load('fr_core_news_md')

matcher = Matcher(nlp.vocab)

def get_entities_verbs():

    try:

        # subjet before verb
        pattern_subj_verb = [{'ENT_TYPE': 'PER', 'DEP': 'nsubj'}, {"POS": {'NOT_IN':['VERB']}, "DEP": {'NOT_IN':['nsubj']}, 'OP':'*'}, {'POS':'VERB'}]
        # subjet after verb
        # this pattern is not good

        matcher.add('ent-verb', [pattern_subj_verb])

        for sent in sent_tokenize(open('Le_Ventre_de_Paris-short.txt').read()):
            sent = nlp(sent)
            matches = matcher(sent)
            for match_id, start, end in matches:
                span = sent[start:end]
                print(span)

    except Exception as error:
        print(error)


def main():

    get_entities_verbs()

if __name__ == '__main__':
    main()

Even if it's french, i can assert you that i get good results

Florent regardait
Lacaille reparut
Florent baissait
Claude regardait
Florent resta
Florent, soulagé
Claude s’était arrêté
Claude en riait
Saget est matinale, dit
Florent allait
Murillo peignait
Florent accablé
Claude entra
Claude l’appelait
Florent regardait
Florent but son verre de punch ; il le sentit
Alexandre, dit
Florent levait
Claude était ravi
Claude et Florent revinrent
Claude, les mains dans les poches, sifflant

I have some wrong results but 90% is good. I just need to grab the first ans last word of each line to have my couple NE/verb. So my question is. How to extract NE when NE is subj with the verb which it correlates to with Matcher or simply how to do that with spacy (not Matcher) ? There are to many factors to be taken into account. Do you have a method to get the best results as possible even if 100% is not possible. I need a pattern matching VERB governor + NER subj after from this pattern:

pattern = [
        {
            "RIGHT_ID": "person",
            "RIGHT_ATTRS": {"ENT_TYPE": "PERSON", "DEP": "nsubj"},
        },
        {
            "LEFT_ID": "person",
            "REL_OP": "<",
            "RIGHT_ID": "verb",
            "RIGHT_ATTRS": {"POS": "VERB"},
        }
        ]

All credit to polm23 for this pattern

Etienne Armangau
  • 255
  • 2
  • 10

1 Answers1

2

This is a perfect use case for the Dependency Matcher. It also makes things easier if you merge entities to single tokens before running it. This code should do what you need:

import spacy
from spacy.matcher import DependencyMatcher

nlp = spacy.load("en_core_web_sm")

# merge entities to simplify this
nlp.add_pipe("merge_entities")


pattern = [
        {
            "RIGHT_ID": "person",
            "RIGHT_ATTRS": {"ENT_TYPE": "PERSON", "DEP": "nsubj"},
        },
        {
            "LEFT_ID": "person",
            "REL_OP": "<",
            "RIGHT_ID": "verb",
            "RIGHT_ATTRS": {"POS": "VERB"},
        }
        ]

matcher = DependencyMatcher(nlp.vocab)
matcher.add("PERVERB", [pattern])

texts = [
        "John Smith and some other guy live there",
        '"Hello!", says Mary.',
        ]

for text in texts:
    doc = nlp(text)
    matches = matcher(doc)

    for match in matches:
        match_id, (start, end) = match
        # note order here is defined by the pattern, so the nsubj will be first
        print(doc[start], "::", doc[end])
    print()

Check out the docs for the DependencyMatcher.

polm23
  • 14,456
  • 7
  • 35
  • 59
  • It's the module what i wanted but unfortunately, your code doesn't work in french. I've loaded the good module before fore sure though. I'll need to check that because i cant find any pattern. – Etienne Armangau Apr 26 '21 at 09:20
  • Hm, that's odd, it should be basically the same in French. Looking at your sample code, maybe you need to change PERSON in my code to PER? – polm23 Apr 26 '21 at 10:11
  • As an experiment, maybe remove the ENT_TYPE restriction so you get all subjects, and then add it back. – polm23 Apr 26 '21 at 10:12
  • It works like a charm. It's PER in french for ENT_TYPE. I checked couple min ago. My brain didn't work. – Etienne Armangau Apr 26 '21 at 10:26
  • How to match verb + NE. I'm struggling. If i have the good pattern, i'll make an if test with match id and i'll print words in the right order. At least, i can do that... – Etienne Armangau Apr 26 '21 at 11:40
  • Any idea ? i struggle since this morning – Etienne Armangau Apr 26 '21 at 17:42
  • This pattern is already matching verb + NE, what is the different thing you want to match? Do you mean NE as object? – polm23 Apr 27 '21 at 02:18
  • No NE subj after the governor verb. How to know in which situation NE will be before or after the verb. How to know that ?. To print sequence in the right order. – Etienne Armangau Apr 28 '21 at 02:04
  • Check the token indices (`tok.i`). The lower number comes first in the document. – polm23 Apr 28 '21 at 03:03