I can see that in the English spaCy models the medium model performs better than the small one, and the large model outperforms the medium one - but only marginally. However, in the description of the models, it is written that they have all been trained on OntoNotes. The exception being the vectors of md and lg, which have been trained on CommonCrawl. So if all models were trained on the same dataset (OntoNotes), and the only difference is the vectors, why then is there a performance difference for the tasks that don't require vectors? I would love to find out more about each model and the settings they were trained with and so on, but it appears that this information is not readily available.
1 Answers
So if all models were trained on the same dataset (OntoNotes), and the only difference is the vectors, why then is there a performance difference for the tasks that don't require vectors?
I think the missing piece you're looking for is this one: If models are initialised with vectors, those vectors will be used as features during training. Depending on the vectors, this can give the statistical model components you train a significant boost in accuracy.
However, vectors can be quite large, so you typically want to find the best trade-off between model size and accuracy. If vectors were used during training, the same vectors also need to be available at runtime, and you can't easily swap them out – otherwise, the model will perform much worse. The sm
model, which wasn't trained with vectors, allows you to load in your own vectors for, say, similarity comparisons, without affecting the predictions of the pre-trained statistical components.
TL;DR: spaCy's sm
, md
and lg
core models were all trained on the same data under the same conditions. The only difference is the vectors that are included, which are used as features and thus have an impact on the model's accuracy.

- 6,935
- 3
- 38
- 53
-
I see. So a token's feature vector helps to identify for instance its PoS tag? I can't seem to find this in the source code. It is a bit odd, though, that the performance difference is so small, right? I am not entirely sure which parts of spaCy are statistical and which one are neural, though. Do you have any sources on this? – Bram Vanroy Sep 11 '19 at 09:41
-
Yes, the word vector gives you a more useful representation of a word than just its text /prefix/suffix/shape. [This video](https://www.youtube.com/watch?v=sqDHBH9IjRU) explains the features in more detail. The impact depends on how relevant the vectors are. It's not always that significant, because word vectors are limiting and don't really reflect the "context". That's why transfer learning and embeddings like BERT are so exciting. And there's no "neural vs. statistical" distinction. The components that predict linguistic features are statistical models, implemented using neural networks. – Ines Montani Sep 11 '19 at 13:22
-
Thank you for this explanation. Could you expand on the answer to include some information about which situations one might want to choose one model over the other? As far as I understand, it's a matter of making trade-offs between size, performance, and more "granular" text analysis, for lack of a better term. But I struggle to understand which models are "best" for which purposes. – leifericf May 25 '22 at 07:56