5

I am trying to find the vocabulary size of the large English model, i.e. en_core_web_lg, and I find three different sources of information:

  • spaCy's docs: 685k keys, 685k unique vectors

  • nlp.vocab.__len__(): 1340242 # (number of lexemes)

  • len(vocab.strings): 1476045

What is the difference between the three? I have not been able to find the answer in the docs.

today
  • 32,602
  • 8
  • 95
  • 115
Yannis Ch
  • 53
  • 1
  • 4

2 Answers2

6

The most useful numbers are the ones related to word vectors. nlp.vocab.vectors.n_keys tells you how many tokens have word vectors and len(nlp.vocab.vectors) tells you how many unique word vectors there are (multiple tokens can refer to the same word vector in md models).

len(vocab) is the number of cached lexemes. In md and lg models most of those 1340242 lexemes have some precalculated features (like Token.prob) but there can be additional lexemes in this cache without precalculated features since more entries can be added as you process texts.

len(vocab.strings) is the number of strings related to both tokens and annotations (like nsubj or NOUN), so it's not a particularly useful number. All strings used anywhere in training or processing are stored here so that the internal integer hashes can be converted back to strings when needed.

aab
  • 10,858
  • 22
  • 38
  • Thank you very much for your reply. Is there a way to determine which strings have a distinct word vector, and which map to the same vector? What is the default string/word vector that all out-of-vocabulary words map to? – Yannis Ch Jan 07 '20 at 07:54
  • Check out `Vectors.data` and `Vectors.key2row`: https://spacy.io/api/vectors#attributes. The default OOV is all 0s. – aab Jan 07 '20 at 15:19
1

Since spaCy 2.3+, according to the release notes, the lexemes are not loaded in nlp.vocab; so using len(nlp.vocab) is not effective. Instead, use nlp.meta['vectors'] to find the number of unique vectors and words. Here is the relevant section from release notes:

To reduce the initial loading time, the lexemes in nlp.vocab are no longer loaded on initialization for models with vectors. As you process texts, the lexemes will be added to the vocab automatically, just as in small models without vectors.

To see the number of unique vectors and number of words with vectors, see nlp.meta['vectors'], for example for en_core_web_md there are 20000 unique vectors and 684830 words with vectors:

{
    'width': 300,
    'vectors': 20000,
    'keys': 684830,
    'name': 'en_core_web_md.vectors'
}
today
  • 32,602
  • 8
  • 95
  • 115