6

I am using the doc2vec model from teh gensim framework to represent a corpus of 15 500 000 short documents (up to 300 words):

gensim.models.Doc2Vec(sentences, size=400, window=10, min_count=1, workers=8 )

After creating the vectors there are more than 18 000 000 vectors representing words and documents.

I want to find the most similar items (words or documents) for a given item:

 similarities = model.most_similar(‘uid_10693076’)

but I get a MemoryError when the similarities are computed:

Traceback (most recent call last):

   File "article/test_vectors.py", line 31, in <module> 
    similarities = model.most_similar(item) 
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 639, in most_similar 
    self.init_sims() 
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 827, in init_sims 
    self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL) 

I have a Ubuntu machine with 60GB Ram and 70GB swap . I checked the memory allocation (in htop) and I observed that never the memory was completely used. I also set to unlimited the the maximum address space which may be locked in memory in python:

resource.getrlimit(resource.RLIMIT_MEMLOCK)

Could someone explain the reason for this MemoryError? In my opinion the available memory should be enough for doing this computations. Could be some memory limits in python or OS?

Thanks in advance!

1 Answers1

15

18M vectors * 400 dimensions * 4 bytes/float = 28.8GB for the model's syn0 array (trained vectors)

The syn1 array (hidden weights) will also be 28.8GB – even though syn1 doesn't really need entries for doc-vectors, which are never target-predictions during training.

The vocabulary structures (vocab dict and index2word table) will likely add another GB or more. So that's all your 60GB RAM.

The syn0norm array, used for similarity calculations, will need another 28.8GB, for a total usage of around 90GB. It's the syn0norm creation where you're getting the error. But even if syn0norm creation succeeded, being that deep into virtual memory would likely ruin performance.

Some steps that might help:

  • Use a min_count of at least 2: words appearing once are unlikely to contribute much, but likely use a lot of memory. (But since words are a tiny portion of your syn0, this will only save a little.)

  • After training but before triggering init_sims(), discard the the syn1 array. You won't be able to train more, but your existing word/doc vectors remain accessible.

  • After training but before calling most_similar(), call init_sims() yourself with a replace=True parameter, to discard the non-normalized syn0 and replace it with the syn0norm. Again you won't be able to train more, but you'll save the syn0 memory.

In-progress work separating out the doc and word vectors, which will appear in gensim past verstion 0.11.1, should also eventually offer some relief. (It'll shrink the syn1 to only include word entries, and allow doc-vectors to come from a file-backed (memmap'd) array.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • 1
    I did what you proposed, I set up the min_count = 2 and I eliminated the syn1 array, and it worked :). However, could you explain why, if I have 60Gb RAM and 70Gb Swap (in total 130Gb memory), the system is not working when it needs around 90Gb of memory? Does gensim word2vec implementation needs to fit all the information in the RAM? Thanks you! – Silvia Necsulescu Jun 16 '15 at 15:05
  • 1
    Gensim's word2vec needs structures in addressable-space but not necessarily RAM. Still, you never want to rely on swap for actively, randomly accessed big data. 60GB RAM for syn0 & syn1; 30GB for syn0norm (but see new third bullet above for another recommendation) – then most_similar() does an array-distance to every vector, temporarily using another 30GB (but I think some upcoming fixes will allow doing that is smaller batches). That's 120GB before considering any other memory uses/fragmentation/inefficiency. Even if it worked, swapping would likely make the perf awful. – gojomo Jun 16 '15 at 22:54
  • How can I eliminate the syn1 array? – Dimmy Magalhães Aug 27 '18 at 20:08
  • As with ay other Python object property, you can delete it with `del`, as for example `del model.syn1`. (Or in the more common case, where negative-sampling is being used and thus only `syn1neg` exists, `del model.syn1neg`.) But as noted, this destroys an important part of the model, and is only appropriate if you'll only ever be looking up the word/doc vectors (never re-training or inferring). And in that case, recent versions of gensim let you save/load/operate-on just the sets-of-vectors. (The answer above is over 3 years old.) – gojomo Aug 27 '18 at 21:30