I extracted 145,185,965 sentences (14GB) out of the english wikipedia dump and I want to train a Doc2Vec model based on these sentences. Unfortunately I have 'only' 32GB of RAM and get a MemoryError when trying to train. Even if I set the min_count to 50, gensim tells me that it would need over 150GB of RAM. I don't think that further increasing the min_count would be a good idea, because the resulting model would be not very good (just a guess). But anyways, I will try it with 500 to see if memory is sufficient then.
Are there any possibilities to train such a large model with limited RAM?
Here is my current code:
corpus = TaggedLineDocument(preprocessed_text_file)
model = Doc2Vec(vector_size=300,
window=15,
min_count=50, #1
workers=16,
dm=0,
alpha=0.75,
min_alpha=0.001,
sample=0.00001,
negative=5)
model.build_vocab(corpus)
model.train(corpus,
epochs=400,
total_examples=model.corpus_count,
start_alpha=0.025,
end_alpha=0.0001)
Are there maybe some obvious mistakes I am doing? Using it completely wrong?
I could also try reducing the vector size, but I think this will result in much worse results as most papers use 300D vectors.