Since spaCy 2.3+, according to the release notes, the lexemes are not loaded in nlp.vocab
; so using len(nlp.vocab)
is not effective. Instead, use nlp.meta['vectors']
to find the number of unique vectors and words. Here is the relevant section from release notes:
To reduce the initial loading time, the lexemes in nlp.vocab
are no
longer loaded on initialization for models with vectors. As you
process texts, the lexemes will be added to the vocab automatically,
just as in small models without vectors.
To see the number of unique vectors and number of words with vectors,
see nlp.meta['vectors']
, for example for en_core_web_md
there are
20000 unique vectors and 684830 words with vectors:
{
'width': 300,
'vectors': 20000,
'keys': 684830,
'name': 'en_core_web_md.vectors'
}