I am just getting started with training Spacy named entity recognition models, and following the basic example described here, where you create training examples by instantiating Doc
objects and serializing those with DocBin
.
My custom preprocess.py
file looks like this:
if __name__ == '__main__':
nlp = spacy.blank("en")
counter = 0
db = DocBin()
with open(sys.argv[1], 'r') as fp:
line = fp.readline()
while line:
record = MyRecord.build(json.loads(line))
doc = record.to_spacy_doc(nlp=nlp)
# internally, something like:
# # char-level indices
# ent = doc.char_span(0, 5, label='SOMETHING')
# doc.set_ents([ent])
db.add(doc)
counter += 1
# hacky way to save 1000 docs in each DocBin
if counter == 1000:
db.to_disk("./train.spacy")
db = DocBin()
if counter == 2000:
db.to_disk("./dev.spacy")
break
line = fp.readline()
Then run the training script with a command like this:
python -m spacy train config.cfg --output ./output --paths.train train.spacy --paths.dev dev.spacy
This seems to work well enough. However, I then read that you can write custom data loader functions by writing and registering a generator that yields instances of Example
in a process described here. This interests me, as in theory you could read from larger-than-RAM files during the training loop.
I wrote such a generator myself in functions.py
, that yields Example
instances from an external data source (which happens to be custom JSONL from disk) where I have some known entity labels (with char-level indices):
@spacy.registry.readers("corpus_variants.v1")
def stream_data(source: str) -> Callable[[Language], Iterator[Example]]:
def generate_stream(nlp: Language):
counter = 0
with open(source, 'r') as fp:
line = fp.readline()
while line:
record = MyRecord.build(json.loads(line))
#doc = nlp(record.text)
doc = nlp.make_doc(record.doc_with_annotations.text)
entities = [
(start, end, label) # char-level offets (not token-level)
for start, end, label, _
in record.get_entity_tuples()
]
gold_dict = dict(
entities=entities
)
example = Example.from_dict(doc, gold_dict)
yield example
counter += 1
# arbitrarily stop at 20 for debugging purposes, but ideally stream the very large file
if counter > 20:
break
line = fp.readline()
return generate_stream
I also modified config.cfg
to contain the following:
[corpora.dev]
@readers = "corpus_variants.v1"
source = "dev.jsonl"
[corpora.train]
@readers = "corpus_variants.v1"
source = "train.jsonl"
When I run the training command:
python -m spacy train config.cfg --output ./output --code functions.py
I get many UserWarning: [W030] Some entities could not be aligned in the text
warnings. I've read a few posts on the theory behind these warnings, but I am curious as to why this behavior does not occur when I am saving DocBin
s? Is the alignment actually different or do the warnings occur only when explicitly creating Example
instances?
I am interested in getting the custom data loader working with this data, but also interested in alternative approaches that essentially allow me to stream arbitrary lines from a (larger-than-RAM) file as training examples.
Finally, what might also be relevant to answering this question is understanding the differences between training a fresh (NER) model from scratch vs. updating an existing one. If I understand Spacy pipelines correctly, there might be some alignment advantages to updating an existing model since the same tokenizer can be used when assembling training examples and during inference.