1

I have documents with over 37M sentences and I'm using Gensim's Doc2Vec to train them. The model training works fine with smaller data sets, say 5M-10M records. However, when training on the full dataset, the process dies mostly at the "resetting layer weights" stage. Sometimes, it dies before.

I'm suspecting that it's a memory issue. I have 16GB of RAM with 4 cores. If it's indeed a memory issue, is there any way I can train the model in batches. From reading the documentation, it seems train() is useful in cases where the new documents don't have new vocabularies. But, this is not the case with my documents.

Any suggestions?

N.Hamoud
  • 33
  • 6

1 Answers1

0

It's not the raw size of your corpus, per se, that makes the model larger, but the number of unique words/doc-tags you want the model to train.

If you're using 37 million unique documents, each with its own ID as its doc-tag, and you're using a common vector-size like 300 dimensions, those doc-vectors alone will require:

37 million * 300 dimensions * 4 bytes/dimension = 44.4 GB

More RAM will be required for the unique words and internal model weights, but not as much as these doc-vectors with a normal-size vocabulary and reasonable choice of min_count to discard rarer words.

Gensim supports streamed training that doesn't require more memory for a larger corpus, but if you want to end up with 47 million 300-dimensional vectors in the same model, that amount of addressable memory will still be required.

Your best bet might be to train a model on some smaller, representative subset – perhaps just a random subset – that fits in addressable memory. Then, when you need vectors for other docs, you could use infer_vector() to calculate them one-at-a-time, then store them somewhere else. (But, you'd still not have them all in memory, which can be crucial for adequately-fast scans for most_similar() or other full-corpus comparisons).

Using a machine with tons of RAM makes working with such large vector-sets much easier.

(One other possible trick involves the use of the mapfile_path parameter – but unless you're familiar with how your operating system handles memory-mapped files, and understand how the big docvecs-array is further used/transformed for your later operations, it may be more trouble than it's worth. It'll also involve a performance hit, which will likely only be tolerable if your docs have a single unique ID tag, so that the pattern of access to the mmapped file is always in training and similarity-searches a simple front-to-back load in the same original order. You can see this answer for more details.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thanks for your feedback. I really appreciate it. I did stream the documents and I believe it saves memory and time on loading the whole corpus and preprocessing it only. As you mentioned, the memory required to build/train the model is still the same and depends on the vector size and word counts. I wonder why the process is killed, even when the SSD has enough space, over 250GB. We tried building the model on a 4 core machine, and it was killed when the memory used about 85GB. Is there a threshold on the OS that triggers the kill error and can we do anything about it? – N.Hamoud Jul 31 '18 at 18:59
  • Another question is regarding the vector size, is 300D good enough and is there a rule of thumb of how to select the vector size. Training the model on a smaller dataset is not ideal for my project, as it is already a small subset of the original dataset. – N.Hamoud Jul 31 '18 at 19:00
  • The model needs addressable memory - essentially RAM. (Using virtual memory, even backed by SSD is quite bad and slow for these models, given the amount of random-access required. Still, the `mapfile_path` trick I mentioned can make use of swapped-in virtual-memory, and the slower performance *might* be tolerable during training as long as you only ever progress through the doctags in original order.) – gojomo Aug 01 '18 at 00:34
  • Because virtual memory is so much slower than RAM, your OS likely has some limits on its use, and it probably won't use more than some small multiple of RAM size as virtual memory. Getting around that would be an OS-specific question... but again, if you're using *any* virtual memory during training, you'll probably have awful training speed, and will wind up with a giant model that's not so good for post-training lookups either. – gojomo Aug 01 '18 at 00:36