I am using the doc2vec model from teh gensim framework to represent a corpus of 15 500 000 short documents (up to 300 words):
gensim.models.Doc2Vec(sentences, size=400, window=10, min_count=1, workers=8 )
After creating the vectors there are more than 18 000 000 vectors representing words and documents.
I want to find the most similar items (words or documents) for a given item:
similarities = model.most_similar(‘uid_10693076’)
but I get a MemoryError when the similarities are computed:
Traceback (most recent call last):
File "article/test_vectors.py", line 31, in <module>
similarities = model.most_similar(item)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 639, in most_similar
self.init_sims()
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 827, in init_sims
self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
I have a Ubuntu machine with 60GB Ram and 70GB swap . I checked the memory allocation (in htop) and I observed that never the memory was completely used. I also set to unlimited the the maximum address space which may be locked in memory in python:
resource.getrlimit(resource.RLIMIT_MEMLOCK)
Could someone explain the reason for this MemoryError? In my opinion the available memory should be enough for doing this computations. Could be some memory limits in python or OS?
Thanks in advance!