I'm trying to train a Named Entity Recognition (NER) model for custom tags using spaCy version 3. I went through all the documentation on their website but I cannot understand what's the proper way to create the pipeline model. Apparently, if I try to use en_core_web_trf
, I'm unable to add my own tags, because the final output scores are all zeroes. But it works correctly for en_core_web_sm
.
However, if I try to do some makeshift method by creating a blank model of English and then manually adding transformer model from en_core_web_trf
and ner model separately from en_core_web_sm
, it works.
My question is - is there a better way to initialize my model and the pipeline methods, other than this makeshift method? I do not care about pre-trained entities like LOCATION, etc. I just want to train my model (using a transformer-based approach) based on custom entities that I have defined in my dataset.
def load_spacy():
spacy.require_gpu()
# 1) 'Makeshift' method
source_nlp = spacy.load("en_core_web_sm")
source_nlp_trf = spacy.load("en_core_web_trf")
nlp = spacy.blank("en")
nlp.add_pipe("transformer", source=source_nlp_trf)
nlp.add_pipe("ner", source=source_nlp)
# 2) trf only method
nlp = spacy.load("en_core_web_trf")
# Getting the pipeline component
ner = nlp.get_pipe("ner")
return ner, nlp
Edit: The exact training methodology I used has been described in this python script in the fit()
function of the defined class NerModel
.
The load_spacy()
in the script (line no. 16) uses the small model, but I was experimenting with the transformer model and used the definition of load_spacy()
which I defined at the beginning of this question.
PS: I have been experimenting on Google Colab (aka a notebook) in order to make use of GPU for transformer, but the source code and methodology are almost the same.