Mallet stops working for large data sets?

Question

I am trying to use LDA Mallet to assign my tweets to topics, and it works perfectly well when I feed it with up to 500,000 tweets, but it seems to stop working when I use my whole data set, which is about 2,500,000 tweets. Do you have any solutions for that?

I am monitoring my CPU and RAM usage when I run my codes as one way to make sure the code is actually running (I use Jupyter notebook). I use the code below to assign my tweets to topics.

import os
from gensim.models.wrappers import LdaMallet

os.environ.update({'MALLET_HOME':r'C:/new_mallet/mallet-2.0.8/'})
mallet_path = 'C:/new_mallet/mallet-2.0.8/bin/mallet'

ldamallet = LdaMallet(mallet_path, corpus=corpus, num_topics=10, id2word=id2word)

The code seems to work when my data set contains fewer than 500,000 tweets: it spits out the results, and I can see python and/or java use my RAM and CPU. However, when I feed the code my entire data set, Java and Python temporarily show some CPU and RAM usage in the first few seconds, but after that the CPU usage drops to below 1 percent and the RAM usage starts to shrink gradually. I tried running the code several times, but after waiting on the code for 6-7 hours, I saw no increase in the CPU usage and the RAM usage dropped after a while. Also, the code did not produce any results. I finally had to stop the code. Has this happen to you? Do you have any solutions for it? Thank you!

score 0 · Answer 1 · answered Jun 24 '19 at 20:31

This sounds like a memory issue, but the interaction with gensim may be masking the error? I don't know enough about gensim's java interaction to be able to suggest anything. You might try running from the command line directly in hopes that errors might be propagated more clearly.

Mallet stops working for large data sets?

1 Answers1