1

First you don't to have to know french to help me as i will explain the grammar rules that i need to apply with spacy in python. I have a file (test.txt) with multiple phrases in french (about 5000), each one different one from another and a mail (textstr) which is different each time (a mail that our client send us). And for each mail i have to see if one of the phrases in the file is in the mail. I thought of using spacy's phrasematcher, but i have one problem: In each mail the sentences are conjugated, so i cannot use the default property of the phrasematcher (As it uses the verbatim token text and does not take into account the conjugation of verbs). So i first thought of using spacy's phrasematching with lemmas to resolve my problem as all conjugated verbs have the same lemma:

def treatemail(emailcontent):
        nlp = spacy.load("fr_core_news_sm")
        with open('test.txt','r',encoding="utf-8") as f:
            phrases_list= f.readlines()
        phrase_matcher = PhraseMatcher(nlp.vocab,attr="LEMMA")
        patterns = [nlp(phrase.strip()) for phrase in phrases_list]
        phrase_matcher.add('phrases', None, *patterns)
        mail = nlp (emailcontent)
        matched_phrases = phrase_matcher(mail)
        for match_id, start, end in matched_phrases:
            span = sentence[start:end]
            print(span.text)

Which is fine for 85% of the phrases from the file, but for the remaining 15% it does not work as some of the verbs in french have reflexive pronouns (Pronouns that comes before a verb): me, te, se, nous, vous, se + verb and the equivalent m',t' and s' + verb, if the verb starts with a voyelle. (They essentially always agree with the subject they refer to.)

In the text file the phrases are written in the infinitive form, so if there is a reflexive pronoun in the phrase, it's written in its infinitive form (either se + verb or s' + verb starting with a voyelle, e.g.: "S'amuser" (to have fun), "se promener" (to take a walk). In the mail the verb is conjugated with its reflective pronoun (Je me promene (I take a walk)).

What i have to do is essentially let the phrasematcher take into account the reflexive pronouns. So here's my question: How can i do that? Should i make a custom component which checks if there's a reflexive pronoun in the email and change the text to its infinitive form or is there some other way?

Thank you very much!

Laz22434
  • 373
  • 1
  • 12

1 Answers1

1

You can use dependency relations for this.

Pasting some example reflexive verb sentences into the displaCy demo, you can see that the reflexive pronouns for these verbs always have an expl:comp relation. A very simple way to find these verbs is to just iterate over tokens and check for that relation. (I am not 100% sure this is the only way it's used, so you should check that, but it seems likely.)

I don't know French so I'm not sure if these verbs have strict ordering, or if words can come between the pronoun and the verb. If the latter (which seems likely), you can't use the normal Matcher or PhraseMatcher because they rely on contiguous sequences of words. But you can use the DependencyMatcher. Something like this:

from spacy.matcher import DependencyMatcher

VERBS = [ ... verbs in your file ... ]

pattern = [
  # anchor token: verb
  {
    "RIGHT_ID": "verb",
    "RIGHT_ATTRS": {"LEMMA": {"IN": VERBS}}
  },
  # has a reflexive pronoun
  {
    "LEFT_ID": "verb",
    "REL_OP": ">",
    "RIGHT_ID": "reflexive-pronoun",
    "RIGHT_ATTRS": {"DEP": "expl:comp"}
  }
]

matcher = DependencyMatcher(nlp.vocab)
matcher.add("REFLEXIVE", [pattern])
matches = matcher(doc)

This assumes that you only care about verb lemmas. If you care about the verb/pronoun combination you can just make a bunch of depmatcher rules or something.

polm23
  • 14,456
  • 7
  • 35
  • 59