I am trying to use LDA Mallet to assign my tweets to topics, and it works perfectly well when I feed it with up to 500,000 tweets, but it seems to stop working when I use my whole data set, which is about 2,500,000 tweets. Do you have any solutions for that?
I am monitoring my CPU and RAM usage when I run my codes as one way to make sure the code is actually running (I use Jupyter notebook). I use the code below to assign my tweets to topics.
import os
from gensim.models.wrappers import LdaMallet
os.environ.update({'MALLET_HOME':r'C:/new_mallet/mallet-2.0.8/'})
mallet_path = 'C:/new_mallet/mallet-2.0.8/bin/mallet'
ldamallet = LdaMallet(mallet_path, corpus=corpus, num_topics=10, id2word=id2word)
The code seems to work when my data set contains fewer than 500,000 tweets: it spits out the results, and I can see python and/or java use my RAM and CPU. However, when I feed the code my entire data set, Java and Python temporarily show some CPU and RAM usage in the first few seconds, but after that the CPU usage drops to below 1 percent and the RAM usage starts to shrink gradually. I tried running the code several times, but after waiting on the code for 6-7 hours, I saw no increase in the CPU usage and the RAM usage dropped after a while. Also, the code did not produce any results. I finally had to stop the code. Has this happen to you? Do you have any solutions for it? Thank you!