Train Spacy model with larger-than-RAM dataset

Question

I asked this question to better understand some of the nuances between training Spacy models with DocBins serialized to disk, versus loading Example instances via custom data loading function. The goal was to train a Spacy NER model with more data that can fit into RAM (or at least some way to avoid loading the entire file into RAM). Though the custom data loader seemed like one specific way to accomplish this, I am writing this question to ask more generally:

How can one train a Spacy model without loading the entire training data set file during training?

score 2 · Accepted Answer · answered Dec 20 '21 at 05:36

2

Your only options are using a custom data loader or setting max_epochs = -1. See the docs.

answered Dec 20 '21 at 05:36

polm23

14,456
7
35
59

Thanks, @polm23. The docs say: `-1 means stream train corpus [] rather than loading in memory with no shuffling within the training loop.` Would setting setting `max_epochs = -1` and using a `.spacy` file (or many `.spacy` files) stream the training data without a custom data loader? – user94154 Dec 20 '21 at 14:45
1

Yes, it will stream the data. – polm23 Dec 21 '21 at 03:01

Train Spacy model with larger-than-RAM dataset

1 Answers1