7

The following link shows how to add custom entity rule where the entities span more than one token. The code to do that is below:

import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)

animal = ["cat", "dog", "artic fox"]
ruler = EntityRuler(nlp)
for a in animal:
    ruler.add_patterns([{"label": "animal", "pattern": a}])
nlp.add_pipe(ruler)


doc = nlp("There is no cat in the house and no artic fox in the basement")

with doc.retokenize() as retokenizer:
    for ent in doc.ents:
        retokenizer.merge(doc[ent.start:ent.end])


from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern =[{'lower': 'no'},{'ENT_TYPE': {'REGEX': 'animal', 'OP': '+'}}]
matcher.add('negated animal', None, pattern)
matches = matcher(doc)


for match_id, start, end in matches:
    span = doc[start:end]
    print(span)

I tried but i got the error bellow:

  • If you created your component with nlp.create_pipe('name'): remove nlp.create_pipe and call nlp.add_pipe('name') instead.

  • If you passed in a component like TextCategorizer(): call nlp.add_pipe with the string name instead, e.g. nlp.add_pipe('textcat').

  • If you're using a custom component: Add the decorator @Language.component (for function components) or @Language.factory (for class components / factories) to your custom component and assign it a name, e.g. @Language.component('your_name'). You can then run nlp.add_pipe('your_name') to add it to the pipeline.

How can I fixed please? NB: spaCy version 3.0.6

Learner
  • 592
  • 1
  • 12
  • 27
  • As a note, you got this error because the question you refer to was for spaCy 2, but you're using spaCy 3. Also the error message you copy pasted here tells you how to fix it, did you try following the instructions? – polm23 Jun 10 '21 at 03:39

3 Answers3

15

For spaCy v2, the normal way to add an entity ruler looked like this:

ruler = EntityRuler(nlp)
nlp.add_pipe(ruler)
ruler.add_patterns(...)

For spaCy v3, you just want to add it with its string name and skip instantiating the class separately:

ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(...)

See: https://spacy.io/usage/v3#migrating-add-pipe

aab
  • 10,858
  • 22
  • 38
4

You need to define your own method to instantiate the entity ruler:

def get_ent_ruler(nlp, name):
    ruler = EntityRuler(nlp)
    for a in animal:
        ruler.add_patterns([{"label": "animal", "pattern": a}])
    return ruler

Then, you may use it the following way:

from spacy.language import Language
Language.factory("ent_ruler", func=get_ent_ruler)
nlp.add_pipe("ent_ruler", last=True)

Also, note the pattern you wrote is not valid. I think you can fix it like this:

pattern =[{'lower': 'no'},{'ENT_TYPE': 'animal'}]

Then, the result is

no cat
no artic fox
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • This works in a short script, but causes a lot of unnecessary headaches if you want to save and reload the model. If you're using a built-in component like `entity_ruler` that already has a factory, it's better to just use that factory name with `nlp.add_pipe("entity_ruler")`. – aab Jun 10 '21 at 07:40
2

For spacy 3.0+, your code should be changed as the following:

import spacy
import re
from spacy.language import Language

nlp = spacy.load('en_core_web_sm')
boundary = re.compile('^[0-9]$')

@Language.component("component")
def custom_seg(doc):
    prev = doc[0].text
    length = len(doc)
    for index, token in enumerate(doc):
        if (token.text == '.' and boundary.match(prev) and index!=(length - 1)):
            doc[index+1].sent_start = False
        prev = token.text
    return doc
    
nlp.add_pipe("component", before='parser')
Park
  • 2,446
  • 1
  • 16
  • 25
  • text = u'This is first sentence.\nNext is numbered list.\n1. Hello World!\n2. Hello World2!\n3. Hello World!' doc = nlp(text) for sentence in doc.sents: print(sentence.text) – Zhiwei Yang Feb 04 '22 at 09:11