I am building a Named Entity Recognition model for biomedical text (cancer papers from Pubmed). I trained a custom NER model using spacy for 3 entities (DISEASE, GENE, and DRUG) types. Further, I combined the model with rule based components to improve the accuracy of my model.
Here is my current code -
# Loaded the trained NER Model
nlp = spacy.load("my_spacy_model")
# Define entity patterns for EntityRuler (just showing 2 relevant patterns here, it contains more patterns)
patterns = [{"label": "GENE", "pattern": "BRCA1"},
{"label": "GENE", "pattern": "BRCA2"}]
ruler = EntityRuler(nlp)
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
When I test the above code on the following piece of text -
text = "Exceptional response to olaparib in BRCA2-altered breast cancer after PD-L1 inhibitor and chemotherapy failure"
I get the following result -
DISEASE BRCA2-altered breast cancer
DRUG olaparib
GENE PD-L1
However, the correct answer is -
GENE BRCA2
^^^^^^^^^^^
DISEASE breast cancer
^^^^^^^^^^^^^^^^^^^^^
DRUG olaparib
GENE PD-L1
The model is not recognizing BRCA2
as a gene, which I have added in the patterns for EntitytRuler
.
Is there a way to prioritize predictions from rule-based matching over the trained model? Alternatively, is there something else I can do to get the correct results by combining rule-based matching?