So I am trying to run LDA mallet on a dataset. It takes in lemma tokens and a bunch of texts which is our dataset. The issue is when we run, a freeze message pops up and all of our old methods that have already ran start running again. It says its due to the multiprocessing starting before the other finished. Not sure how to fix. This is ran on MacOS. Code and output are below.
import gensim
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
import os.path
def optimize_parameters(lemma_tokens, texts):
os.environ['MALLET_HOME'] = '****/mallet-2.0.8'
mallet_path = '****/mallet-2.0.8/bin/mallet'
id2word = Dictionary(lemma_tokens)
# Filtering Extremes
id2word.filter_extremes(no_below=2, no_above=.99)
# Creating a corpus object
corpus = [id2word.doc2bow(d) for d in lemma_tokens]
model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=5, id2word=id2word, workers = 4)
coherencemodel = CoherenceModel(model=model, texts=lemma_tokens, dictionary=id2word, coherence='c_v')
coherence = coherencemodel.get_coherence()
The "****" is the rest of the path that can't be shown due to privacy.
The error output:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
<10> LL/token: -6.83952
<20> LL/token: -6.70949