1

I extracted 145,185,965 sentences (14GB) out of the english wikipedia dump and I want to train a Doc2Vec model based on these sentences. Unfortunately I have 'only' 32GB of RAM and get a MemoryError when trying to train. Even if I set the min_count to 50, gensim tells me that it would need over 150GB of RAM. I don't think that further increasing the min_count would be a good idea, because the resulting model would be not very good (just a guess). But anyways, I will try it with 500 to see if memory is sufficient then.

Are there any possibilities to train such a large model with limited RAM?

Here is my current code:

corpus = TaggedLineDocument(preprocessed_text_file)
model = Doc2Vec(vector_size=300, 
                window=15, 
                min_count=50,  #1
                workers=16, 
                dm=0, 
                alpha=0.75, 
                min_alpha=0.001, 
                sample=0.00001,
                negative=5)
model.build_vocab(corpus)
model.train(corpus, 
            epochs=400, 
            total_examples=model.corpus_count, 
            start_alpha=0.025, 
            end_alpha=0.0001)

Are there maybe some obvious mistakes I am doing? Using it completely wrong?

I could also try reducing the vector size, but I think this will result in much worse results as most papers use 300D vectors.

Simon Hessner
  • 1,757
  • 1
  • 22
  • 49

1 Answers1

3

The required model size in addressable memory is largely a function of the number of weights required, by the number of unique words and unique doc-tags.

With 145,000,000 unique doc-tags, no matter how many words you limit yourself to, just the raw doc-vectors in-training alone will require:

145,000,000 * 300 dimensions * 4 bytes/dimension = 174GB

You could try a smaller data set. You could reduce the vector size. You could get more memory.

I would try one or more of those first, just to verify you're able to get things working and some initial results.

There is one trick, best considered experimental, that may work to allow training larger sets of doc-vectors, at some cost of extra complexity and lower performance: the docvecs_mapfile parameter of Doc2Vec.

Normally, you don't want a Word2Vec/Doc2Vec-style training session to use any virtual memory, because any recourse to slower disk IO makes training extremely slow. However, for a large doc-set, which is only ever iterated over in one order, the performance hit may be survivable after making the doc-vectors array to be backed by a memory-mapped file. Essentially, each training pass sweeps through the file from font-to-back, reading each section in once and paging it out once.

If you supply a docvecs_mapfile argument, Doc2Vec will allocate the doc-vectors array to be backed by that on-disk file. So you'll have a hundreds-of-GB file on disk (ideally SSD) whose ranges are paged in/out of RAM as necessary.

If you try this, be sure to experiment with this option on small runs first, to familiarize yourself with its operation, especially around saving/loading models.

Note also that if you then ever do a default most_similar() on doc-vectors, another 174GB array of unit-normalized vectors must be created from the raw array. (You can force that to be done in-place, clobbering the existing raw values, by explicitly calling the init_sims(replace=True) call before any other method requiring the unit-normed vectors is called.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Okay, thanks for the detailed explanation. So it seems the main cause for that huge memory requirement is that I have one label per sentence. Do you think it would also work if I train it on whole paragraphs or articles instead of sentences but still use only sentences for inference? My guess would be no, but maybe I am wrong. – Simon Hessner May 17 '18 at 20:39
  • 1
    You'd have to try it. It might work well, or even better than shorter sentences. (Doc2Vec aka 'Paragraph Vector' is most often used with multiple-sentence texts.) Separate notes about your setup: • most d2v work uses 10-20 training passes (not the class default of 5, or your `epochs=400`); • your mode, `dm=0`, doesn't use a `window` or train word-vectors (unless you also set `dbow_words=1`); • there's never a reason for a high `alpha=0.75` (even though it's harmless here bc a sensibble value is used when you call `train()`). – gojomo May 18 '18 at 02:17
  • I trained a few models and evaluated them on the MS paraphrase database. The results are okay, but not extremely good (AUC around 0.7 to 0.75, EER around 0.3, average precision around 0.85). I think training with more data would be better. I will try smaller vectors but a larger training data set now. You said that alpha=0.75 is very high. I trained with alpha=0.025 and min_alpha=0.0001. Is that better? I used 100 epochs, maybe that was too much although I found a paper that uses 400 epochs (https://arxiv.org/pdf/1607.05368.pdf) – Simon Hessner May 22 '18 at 13:19
  • Is there a way to save the model every X epochs without the need to call train in a loop? Maybe using the (not documented) callbacks? – Simon Hessner May 22 '18 at 13:21
  • 1
    There's no way to save mid-training unless you use multiple `train()` calls. (If doing so, be careful to choose `epochs` and alpha values carefully to have the same overall passes/alpha-decay effect.) 10-20 passes are more common, sometimes more with tiny datasets – the Lau/Baldwin paper is an outlier (and while it has much useful evaluation, other aspects of its approach/writeup seem confused to me – see comments including links to forum posts starting at: https://github.com/RaRe-Technologies/gensim/issues/1270#issuecomment-293437366 ). Yes, 0.025 to 0.0001 is the default/common alpha choice. – gojomo May 22 '18 at 17:23