1

I am trying to load a saved gensim lda mallet:

 ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=n_topics,id2word=id2word)
 ldamallet.save('ldamallet')

When testing this for a new query (with the original corpus and dictionary), everything seems fine for the first load.

ques_vec = [dictionary.doc2bow(words) for words in data_words_list]
for i, row in enumerate(lda[ques_vec]):
    row = sorted(row, key=lambda x: (x[1]), reverse=True)

On executing the same code afterward, it is this error that pops up:

java.io.FileNotFoundException: /tmp/9f371_corpus.mallet (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at cc.mallet.types.InstanceList.load(InstanceList.java:787) at cc.mallet.classify.tui.Csv2Vectors.main(Csv2Vectors.java:131) Exception in thread "main" java.lang.IllegalArgumentException: Couldn't read InstanceList from file /tmp/9f371_corpus.mallet at cc.mallet.types.InstanceList.load(InstanceList.java:794) at cc.mallet.classify.tui.Csv2Vectors.main(Csv2Vectors.java:131) Traceback (most recent call last): File "topic_modeling1.py", line 406, in topic = get_label(text, id2word, first, ldamallet) File "topic_modeling1.py", line 237, in get_label for i, row in enumerate(lda[ques_vec]): File "/home/user/sjha/anaconda3/envs/conda_env/lib/python3.6/site-packages/gensim/models/wrappers/ldamallet.py", line 308, in getitem self.convert_input(bow, infer=True) File "/home/user/sjha/anaconda3/envs/conda_env/lib/python3.6/site-packages/gensim/models/wrappers/ldamallet.py", line 256, in convert_input check_output(args=cmd, shell=True) File "/home/user/sjha/anaconda3/envs/conda_env/lib/python3.6/site-packages/gensim/utils.py", line 1806, in check_output raise error subprocess.CalledProcessError: Command '/home/user/sjha/projects/topic_modeling/mallet-2.0.8/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input /tmp/9f371_corpus.txt --output /tmp/9f371_corpus.mallet.infer --use-pipe-from /tmp/9f371_corpus.mallet' returned non-zero exit status 1.

Contents of my /tmp/ directory:

/tmp/9f371_corpus.txt  /tmp/9f371_doctopics.txt /tmp/9f371_doctopics.txt.infer  /tmp/9f371_inferencer.mallet  /tmp/9f371_state.mallet.gz  /tmp/9f371_topickeys.txt

Also, it seems like the files /tmp/9f371_doctopics.txt.infer and /tmp/9f371_corpus.txt get modified every time I load the model. What could be the possible error source? Or is it some kind of bug in gensim's mallet wrapper?

Saurav--
  • 1,530
  • 2
  • 15
  • 33
  • Any progress on this @saurav? I have the same issue! – him229 Apr 24 '19 at 05:07
  • The code in this question solved my problem: https://stackoverflow.com/questions/55091094/correct-way-to-load-ldamallet-model-with-gensim-and-classify-unseen-documents – user2409194 Aug 20 '20 at 10:15
  • The code in this question solved this problem for me. https://stackoverflow.com/questions/55091094/correct-way-to-load-ldamallet-model-with-gensim-and-classify-unseen-documents – user2409194 Aug 20 '20 at 10:17

1 Answers1

0

mallet likes to store important model files (the corpus, etc) in /tmp if prefix is unset, and then when /tmp is cleared (say, by restarting) it throws a fit because it needs those files to run. deleting the model and rerunning the algorithm does not solve it- you first must reinstall gensim...

eg

conda uninstall gensim
conda install gensim

or whatever install manager you prefer. then delete your saved models (sorry, their corpus etc are already gone...)

important: before rerunning you need to explicitly set the prefix param when initializing mallet:

prefix = {your chosen prefix dir}
if not os.path.isdir(prefix):
    os.mkdir(prefix)
ldamallet = models.wrappers.LdaMallet({all your other args}, prefix=prefix, ...)
user108569
  • 450
  • 5
  • 8