1

I try to use the Spacy patterns in order to match the corresponding to differents surface shape of person in my text as: LASTNAME, FIRSTNAME or/and FIRSTNAME, LASTNAME and/or FIRSTNAME LASTNAME (no punct).

I Try this:

import spacy

# create a nlp object with pretrained model
nlp = spacy.load('fr_core_news_lg')
ruler = nlp.add_pipe("entity_ruler", after='ner') 

# define the patterns
patterns = [{"label":"PER", "pattern":[{"LOWER":"jules"},{"LOWER":"michelet"}]},
            {"label":"PER", "pattern": [{"LOWER":"jean joseph"},{"LOWER":"laborde"}]},
            ]

specific_forms_patterns_persons = [
    {"label": "PER", "pattern": [{"ENT_TYPE": "PER"}, {"IS_PUNCT": True}, {"ENT_TYPE": "PER"}]}
    ]

# add patterns to ruler
ruler.add_patterns(patterns)
ruler.add_patterns(specific_forms_patterns_persons)

# convert the input sentence into the document object using the 'nlp'
doc = nlp("Jules, Michelet avec Laborde, Jean Joseph et Jacques Mei à Paris.")

# print the entities in the sentenced after adding the EntityRuler matcher
print([(ent.text, ent.label_) for ent in doc.ents])

I get this output:

[('Jules', 'PER'), ('Michelet', 'PER'), ('Laborde', 'PER'), ('Jean Joseph', 'PER'), ('Jacques Mei', 'PER'), ('Paris', 'LOC')]

While I want to get:

[('Jules, Michelet', 'PER'), ('Laborde, Jean Joseph', 'PER'), ('Jacques Mei', 'PER'), ('Paris', 'LOC')]

I try to customize my pattern with :

specific_forms_patterns_persons = [
    {"label": "PER", "pattern": [{"ENT_TYPE": "PER"}, {"ORTH": ","}, {"ENT_TYPE": "PER"}]}
    ]

but I still get the same output. It would be best to train a spacy model to recognize these specific shapes but I was wondering if this was possible only with rules.

Zoe
  • 27,060
  • 21
  • 118
  • 148
Lter
  • 43
  • 11

1 Answers1

1

You need to use overwrite_ents in the EntityRuler or it won't change existing labels.

cfg = {"overwrite_ents": True}
ruler = nlp.add_pipe("entity_ruler", after='ner', config=cfg)

You should also look at matching more than one PER token using an OP to handle the "Laborde, Jean Joseph" case.

Note that when working on your patterns, I would recommend you avoid using the existing labels like PER, as it makes it hard to understand what's a pre-existing annotation and what's yours. Until you finalize it use MYPER or something.

polm23
  • 14,456
  • 7
  • 35
  • 59