I'd like to create a big gensim dictionary for french language to try getting better results in topic detection, similarities between texts and other things like that. So I've planned to use a wikipedia dump and process it the following way:
- Extract each article from frwiki-YYYYMMDD-pages-articles.xml.bz2 (Done)
- Tokenize each article (basically converting the text to lowercases, removing stop words and non-word characters) (Done)
- Train a Phrases model on the articles to detect collocation.
- Stem the resulting tokens in each article.
- Feed the dictionary with the new corpus (one stemmed-collocated-tokenized article per line)
Because of the very large size of the corpus, I don't store anything in memory and access the corpus via smart_open but it appears gensim Phrases model is consuming too much RAM to complete the third step.
Here is my sample code:
corpus = smart_open(corpusFile, "r")
phrases = gensim.models.Phrases()
with smart_open(phrasesFile, "wb") as phrases_file:
chunks_size = 10000
texts, i = [], 0
for text in corpus:
texts.append(text.split())
i += 1
if i % chunks_size == 0:
phrases.add_vocab(texts)
texts = []
phrases.save(phrases_file)
corpus.close()
Is there a way to complete the operation without freezing my computer or will I have to train the Phrases model only on a subset of my corpus?