I am trying to use spaCy to create a new entity categorization 'Species' with a list of species names, example can he found here.
I found a tutorial for training new entity type from this spaCy tutorial (Github code here). However, the problem is, I don't want to manually create a sentence for each species name as it would be very time consuming.
I created below training data, which looks like this:
TRAIN_DATA = [('Bombina',{'entities':[(0,6,'SPECIES')]}),
('Dermaptera',{'entities':[(0,9,'SPECIES')]}),
....
]
The way I created the training set is: instead of providing a full sentence and the location of the matched entity, I only provide the name of each species, and the start and end index are programmatically generated:
[( 0, 6, 'SPECIES' )]
[( 0, 9, 'SPECIES' )]
Below training code is what I used to train the model. (Code copied from above hyperlink)
nlp = spacy.blank('en') # create blank Language class
# Add entity recognizer to model if it's not in the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)
# otherwise, get it, so we can add labels to it
else:
ner = nlp.get_pipe('ner')
ner.add_label(LABEL) # add new entity label to entity recognizer
if model is None:
optimizer = nlp.begin_training()
else:
# Note that 'begin_training' initializes the models, so it'll zero out
# existing entity types.
optimizer = nlp.entity.create_optimizer()
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer, drop=0.35, losses=losses)
print(losses)
I'm new to NLP and spaCy please let me know if I did it correctly or not. And why my attempt failed the training (when I ran it, it throws an error).
[UPDATE]
The reason I want to feed keyword only to the training model is that, ideally, I would hope the model to learn those key words first, and once it identifies a context which contains the keyword, it will learn the associated context, and therefore, enhance the current model.
At the first glance, it is more like regex expression. But with more and more data feeding in, the model will continuous learn, and finally being able to identify new species names that previously not exists in the original training set.
Thanks, Katie