2

I am building a Named Entity Recognition model for biomedical text (cancer papers from Pubmed). I trained a custom NER model using spacy for 3 entities (DISEASE, GENE, and DRUG) types. Further, I combined the model with rule based components to improve the accuracy of my model.

Here is my current code -


# Loaded the trained NER Model
nlp = spacy.load("my_spacy_model")

# Define entity patterns for EntityRuler (just showing 2 relevant patterns here, it contains more patterns)
patterns = [{"label": "GENE", "pattern": "BRCA1"},
            {"label": "GENE", "pattern": "BRCA2"}]

ruler = EntityRuler(nlp)

ruler.add_patterns(patterns)

nlp.add_pipe(ruler)

When I test the above code on the following piece of text -

text = "Exceptional response to olaparib in BRCA2-altered breast cancer after PD-L1 inhibitor and chemotherapy failure"

I get the following result -

DISEASE  BRCA2-altered breast cancer
DRUG  olaparib
GENE PD-L1

However, the correct answer is -

GENE BRCA2
^^^^^^^^^^^
DISEASE breast cancer
^^^^^^^^^^^^^^^^^^^^^
DRUG  olaparib
GENE PD-L1

The model is not recognizing BRCA2 as a gene, which I have added in the patterns for EntitytRuler.

Is there a way to prioritize predictions from rule-based matching over the trained model? Alternatively, is there something else I can do to get the correct results by combining rule-based matching?

iCHAIT
  • 525
  • 6
  • 16

1 Answers1

6

You can either add the EntityRuler before the NER component in the pipeline:

nlp.add_pipe(ruler, before="ner")

Or tell the EntityRuler to overwrite existing entities:

ruler = EntityRuler(nlp, overwrite_ents=True)

The NER predictions might be slightly different in each case, because in the first option, the model's predictions might change given the presence of existing entity spans.

aab
  • 10,858
  • 22
  • 38
  • I already tried adding EntityRuler before the NER Component. That helps for the particular case that I have shared in my Question, however, the updated model is not able to tag entities which it was able to identify earlier. It learns the rule-based entities that I supply but forgets many entities that it was tagging previously. How can I overcome this issue? – iCHAIT Aug 29 '19 at 06:55
  • I tried using `overwrite_ents = True` and it gave me the following results - DISEASE breast cancer DRUG olaparib GENE PD-L1 It is not able to recognize `BRCA2` as a Gene. I think that is because in the given sentence BRCA2-altered is a word and I don't have a rule for that. Can you explain a bit about what `overwrite_ents = True` is doing? I read the relevant [doc](https://spacy.io/api/entityruler#init), but couldn't undertstand what it is doing. – iCHAIT Aug 29 '19 at 07:22
  • 1
    Then I think `overwrite_ents = True` is the better option for you. Ines gives a good explanation here: https://github.com/explosion/spaCy/issues/3775 – aab Aug 29 '19 at 07:24
  • 1
    If any part of the EntityRuler entity overlaps with an existing entity (even partially), the EntityRuler removes the existing entity and adds the new EntityRuler one. – aab Aug 29 '19 at 07:26
  • Thanks for the link and the explanation. I have one follow up question though, using `overwrite_ents = True` it gives me the following results - ``` DISEASE breast cancer DRUG olaparib GENE PD-L1 ``` It is not able to recognize `BRCA2` as a gene. I think that is because in the given text `BRCA2-altered` acts as one word. And since I don't have that in my patterns, it does not tag it. How can I work around that? – iCHAIT Aug 29 '19 at 07:28
  • I obviously can't add `BRCA2-altered` to my patterns list since that is not a valid GENE. How can I extract and tag BRCA2 here? – iCHAIT Aug 29 '19 at 07:36
  • 1
    To do this cleanly in spacy you would have the change the tokenization. Changing the tokenizer to split on every `-` might cause headaches elsewhere, though. Here's how to customize infixes in the tokenizer: https://stackoverflow.com/a/57304882/461847 – aab Aug 29 '19 at 12:58