It's not the raw size of your corpus, per se, that makes the model larger, but the number of unique words/doc-tags you want the model to train.
If you're using 37 million unique documents, each with its own ID as its doc-tag, and you're using a common vector-size like 300 dimensions, those doc-vectors alone will require:
37 million * 300 dimensions * 4 bytes/dimension = 44.4 GB
More RAM will be required for the unique words and internal model weights, but not as much as these doc-vectors with a normal-size vocabulary and reasonable choice of min_count
to discard rarer words.
Gensim supports streamed training that doesn't require more memory for a larger corpus, but if you want to end up with 47 million 300-dimensional vectors in the same model, that amount of addressable memory will still be required.
Your best bet might be to train a model on some smaller, representative subset – perhaps just a random subset – that fits in addressable memory. Then, when you need vectors for other docs, you could use infer_vector()
to calculate them one-at-a-time, then store them somewhere else. (But, you'd still not have them all in memory, which can be crucial for adequately-fast scans for most_similar()
or other full-corpus comparisons).
Using a machine with tons of RAM makes working with such large vector-sets much easier.
(One other possible trick involves the use of the mapfile_path
parameter – but unless you're familiar with how your operating system handles memory-mapped files, and understand how the big docvecs-array is further used/transformed for your later operations, it may be more trouble than it's worth. It'll also involve a performance hit, which will likely only be tolerable if your docs have a single unique ID tag, so that the pattern of access to the mmapped file is always in training and similarity-searches a simple front-to-back load in the same original order. You can see this answer for more details.)