How to train a Phrases model from a huge corpus of articles (wikipedia)?

Question

I'd like to create a big gensim dictionary for french language to try getting better results in topic detection, similarities between texts and other things like that. So I've planned to use a wikipedia dump and process it the following way:

Extract each article from frwiki-YYYYMMDD-pages-articles.xml.bz2 (Done)
Tokenize each article (basically converting the text to lowercases, removing stop words and non-word characters) (Done)
Train a Phrases model on the articles to detect collocation.
Stem the resulting tokens in each article.
Feed the dictionary with the new corpus (one stemmed-collocated-tokenized article per line)

Because of the very large size of the corpus, I don't store anything in memory and access the corpus via smart_open but it appears gensim Phrases model is consuming too much RAM to complete the third step.

Here is my sample code:

corpus = smart_open(corpusFile, "r")
phrases = gensim.models.Phrases()
with smart_open(phrasesFile, "wb") as phrases_file:
    chunks_size = 10000
    texts, i = [], 0
    for text in corpus:
        texts.append(text.split())
        i += 1
        if i % chunks_size == 0:
            phrases.add_vocab(texts)
            texts = []
    phrases.save(phrases_file)
corpus.close()

Is there a way to complete the operation without freezing my computer or will I have to train the Phrases model only on a subset of my corpus?

fbparis · Accepted Answer · 2019-01-23T05:03:32.470

1

I'm answering myself because I realized I forgot to deal with some memory related parameters in the Phrases class.

So, first I've divided max_vocab_size by 2 so it should consume less memory, and also I've decided to save the Phrases object every 100 000 articles and then reload it from the saved file as these kind of tricks have shown they can be helpful with some other classes in the gensim lib...

Here is the new code, a little slower maybe but it has completed the task successfully:

corpus = smart_open(corpusFile, "r")
max_vocab_size=20000000
phrases = Phrases(max_vocab_size=max_vocab_size)
chunks_size = 10000
save_every = 100000
texts, i = [], 0
for text in corpus:
    texts.append(text.split())
    i += 1
    if i % chunks_size == 0:
        phrases.add_vocab(texts)
        texts = []
    if i % save_every == 0:
        phrases.save(phrasesFile)
        phrases = Phrases.load(phrasesFile)
corpus.close()
phrases.save(phrasesFile)

Ending up with 412 816 phrasegrams in my case after putting all this in a Phraser object.

edited Jan 23 '19 at 05:03

answered Jan 23 '19 at 04:12

fbparis

880
1
10
23

1

Using `max_vocab_size` will help, by discarding running word/bigram counts of less-frequent items whenever the tallying dictionary hits the configured maximum. (Note, though, that this makes the counts of less-frequent words approximate. And, each trim-action actually shrinks the tallying dictionary to a size far below the `max_vocab_size` – so the lost-data may be larger than you expect, and the final number of items may be far smaller than `max_vocab_size`, rather than just at it.) OTOH, saving/reloading should have no effect (or possibly negative effect) on the overall memory usage. – gojomo Jan 23 '19 at 19:26
@gojomo not so sure for the load/save thing: the first save was difficult, I mean for time and memory (when the Phrases was smaller!) but the next ones consumed less time and memory so in the doubt I would recommend to keep it that way. And if the whole thing has to crash or freeze during the save process, the sooner the better :) – fbparis Jan 24 '19 at 04:20
I'm fairly familiar with the source code, and don't see any way a save/reload could save memory, but it definitely could use more. – gojomo Jan 24 '19 at 06:16

How to train a Phrases model from a huge corpus of articles (wikipedia)?

1 Answers1