Spacy alignment differences when training with DocBin vs custom data reading and batching

Question

I am just getting started with training Spacy named entity recognition models, and following the basic example described here, where you create training examples by instantiating Doc objects and serializing those with DocBin.

My custom preprocess.py file looks like this:


if __name__ == '__main__':
    nlp = spacy.blank("en")
    counter = 0

    db = DocBin()

    with open(sys.argv[1], 'r') as fp:
        line = fp.readline()
        while line:

            record = MyRecord.build(json.loads(line))

            doc = record.to_spacy_doc(nlp=nlp)
            # internally, something like:
            # # char-level indices
            # ent = doc.char_span(0, 5, label='SOMETHING') 
            # doc.set_ents([ent])

            db.add(doc)

            counter += 1
            # hacky way to save 1000 docs in each DocBin
            if counter == 1000:
                db.to_disk("./train.spacy")
                db = DocBin()

            if counter == 2000:
                db.to_disk("./dev.spacy")
                break
            line = fp.readline()

Then run the training script with a command like this:

python -m spacy train config.cfg --output ./output --paths.train train.spacy --paths.dev dev.spacy

This seems to work well enough. However, I then read that you can write custom data loader functions by writing and registering a generator that yields instances of Example in a process described here. This interests me, as in theory you could read from larger-than-RAM files during the training loop.

I wrote such a generator myself in functions.py, that yields Example instances from an external data source (which happens to be custom JSONL from disk) where I have some known entity labels (with char-level indices):

@spacy.registry.readers("corpus_variants.v1")
def stream_data(source: str) -> Callable[[Language], Iterator[Example]]:
    def generate_stream(nlp: Language):
        counter = 0
        with open(source, 'r') as fp:
            line = fp.readline()
            while line:

                record = MyRecord.build(json.loads(line))

                #doc = nlp(record.text)
                doc = nlp.make_doc(record.doc_with_annotations.text)

                entities = [
                    (start, end, label)     # char-level offets (not token-level)
                    for start, end, label, _
                    in record.get_entity_tuples()
                ]

                gold_dict = dict(
                    entities=entities
                )

                example = Example.from_dict(doc, gold_dict)
                yield example

                counter += 1
                # arbitrarily stop at 20 for debugging purposes, but ideally stream the very large file
                if counter > 20:
                    break
                line = fp.readline()
    return generate_stream

I also modified config.cfg to contain the following:

[corpora.dev]
@readers = "corpus_variants.v1"
source = "dev.jsonl"

[corpora.train]
@readers = "corpus_variants.v1"
source = "train.jsonl"

When I run the training command:

python -m spacy train config.cfg --output ./output --code functions.py

I get many UserWarning: [W030] Some entities could not be aligned in the text warnings. I've read a few posts on the theory behind these warnings, but I am curious as to why this behavior does not occur when I am saving DocBins? Is the alignment actually different or do the warnings occur only when explicitly creating Example instances?

I am interested in getting the custom data loader working with this data, but also interested in alternative approaches that essentially allow me to stream arbitrary lines from a (larger-than-RAM) file as training examples.

Finally, what might also be relevant to answering this question is understanding the differences between training a fresh (NER) model from scratch vs. updating an existing one. If I understand Spacy pipelines correctly, there might be some alignment advantages to updating an existing model since the same tokenizer can be used when assembling training examples and during inference.

Without the details for exactly how the docs are constructed it's very hard to say, but you want to compare the tokenization and `doc.ents` for both methods to see what the differences are. — aab, Dec 16 '21 at 13:07
Thanks @aab -- it's tough to get succinct example code to demonstrate this, but I spot checked a few of these and `doc.ents` from the `DocBin` example returns the same number of ents and `len(doc)` returns the same number of tokens as `len(entities)` and `len(doc)` from the `Example` example, respectively. This suggests to me the alignment issues are likely present when serializing as `DocBin`, but the warnings just don't show up for some reason. If this is the case, it would be nice to confirm this behavior. — user94154, Dec 19 '21 at 16:53
Depending on how you do the conversion, it's certainly possible that you might not run into this particular warning. The warning is not related to `DocBin`, just how the underlying `Doc` objects are created. — aab, Dec 21 '21 at 10:21

Spacy alignment differences when training with DocBin vs custom data reading and batching

0 Answers0

Linked