0

I am training a ldamallet model in python and saving it. I am also saving training dictionary that I can use to create corpus for unseen documents later. If I perform every action (i.e. train a model, save trained model, load saved model, infer unseen corpus) within same console, everything works fine. However, I want to use the trained model in different console / computer.

I passed prefix while training to look at the temp files created by the model. Following files are created when the model is trained:

'corpus.mallet'

'corpus.txt'

'doctopics'txt'

inferencer.mallet'

'state.mallet.gz'

'topickeys.txt'

Now when I load the saved model in a different console and infer unseen corpus created using the saved dictionary, I can see no other temp files being created and produces following error:

FileNotFounderror: No such file or directory : 'my_directory\\doctopics.txt.infer'

For some odd reason, if I load the saved model in same console (console it was trained on) and infer unseen corpus like above, 'corpus.txt' is updated and two new temp files are created:

'corpus.mallet.infer'

'doctopics.txt.infer'

Any idea why I might be having this issue?

I have tried using LdaModel instead of LdaMallet and LdaModel works fine irrespective of whether I perform whole task in same console or different console.

Below is the snippet of the code I am using.

    def find_optimum_model(self):
        lemmatized_words = self.lemmatization()
        id2word = corpora.Dictionary(lemmatized_words)
        all_corpus = [id2word.doc2bow(text) for text in lemmatized_words]

        #For two lines below update with your path to new_mallet
        os.environ['MALLET_HOME'] = r'C:\\users\\axk0er8\\Sentiment_Analysis_Working\\new_mallet\\mallet-2.0.8'
        mallet_path = r'C:\\users\\axk0er8\\Sentiment_Analysis_Working\\new_mallet\\mallet-2.0.8\\bin\\mallet.bat'
        prefix_path = r'C:\\users\\axk0er8\\Sentiment_Analysis_Working\\new_mallet\\mallet_temp\\'

    def compute_coherence_values(dictionary, all_corpus, texts, limit, start=2, step=4):
        coherence_values = []
        model_list = []
        num_topics_list = []


        for num_topics in range(start, limit, step):
            model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=all_corpus, num_topics=num_topics, id2word=dictionary,
                                                random_seed=42)
            #model = gensim.models.ldamodel.LdaModel(corpus=all_corpus,num_topics=num_topics,id2word=dictionary,eval_every=1,
            #                                        alpha='auto',random_state=42)
            model_list.append(model)
            coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
            coherence_values.append(coherencemodel.get_coherence())
            num_topics_list.append(num_topics)

        return model_list, coherence_values, num_topics_list

    model_list, coherence_values, num_topics_list = compute_coherence_values(dictionary=id2word,all_corpus=all_corpus,
                                                                                texts=lemmatized_words,start=5,limit=40, step=6)
    model_values_df = pd.DataFrame({'model_list':model_list,'coherence_values':coherence_values,'num_topics':num_topics_list})

    optimal_num_topics = model_values_df.loc[model_values_df['coherence_values'].idxmax()]['num_topics']

    optimal_model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=all_corpus, num_topics=optimal_num_topics, id2word=id2word,
                                                        prefix=prefix_path, random_seed=42)

    #joblib.dump(id2word,'id2word_dictionary_mallet.pkl')
    #joblib.dump(optimal_model,'optimal_ldamallet_model.pkl')
    id2word.save('id2word_dictionary.gensim')
    optimal_model.save('optimal_lda_model.gensim')

    def generate_dominant_topic(self):
        lemmatized_words = self.lemmatization()
        id2word = corpora.Dictionary.load('id2word_dictionary.gensim')
        #id2word = joblib.load('id2word_dictionary_mallet.pkl')
        new_corpus = [id2word.doc2bow(text) for text in lemmatized_words]
        optimal_model = gensim.models.wrappers.LdaMallet.load('optimal_lda_model.gensim')
        #optimal_model = joblib.load('optimal_ldamallet_model.pkl')


        def format_topics_sentences(ldamodel, new_corpus):
            sent_topics_df = pd.DataFrame()
            for i, row in enumerate(ldamodel[new_corpus]):
                row = sorted(row, key=lambda x: (x[1]), reverse=True)
                for j, (topic_num, prop_topic) in enumerate(row):
                    if j == 0:
                        wp = ldamodel.show_topic(topic_num)
                        topic_keywords = ", ".join([word for word, prop in wp])
                        sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]),
                                                               ignore_index=True)
                    else:
                        break
            sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']
            return (sent_topics_df)

My expectation is use find_optimum_model function with the training data and save the optimum model and dictionary. Once saved, use generate_dominant_topic function to load saved model and dictionary, create corpus for unseen text and run the model to get desired topic modeling output.

3 Answers3

0

Despite the name, those aren't actually 'temp' files, as the model needs them to run. I'd strongly suggest you copy them to the new console in the same relative location (prefix) as on the old (so the model knows where to look for them.) Hopefully that will work. I haven't tried it myself however.

The infer files appear when you go from training to classifying with the model. I assume they need the inferencer to be built... I've had to delete and retrain my mallet models so many times as these files get corrupted often.
user108569
  • 450
  • 5
  • 8
0

Yeah, you will need to bring those files along with you: https://github.com/RaRe-Technologies/gensim/issues/818

Sam
  • 1
  • 3
    A link to a solution is welcome, but please ensure your answer is useful without it: [add context around the link](//meta.stackexchange.com/a/8259) so your fellow users will have some idea what it is and why it’s there, then quote the most relevant part of the page you're linking to in case the target page is unavailable. [Answers that are little more than a link may be deleted.](//stackoverflow.com/help/deleted-answers) – Das_Geek Sep 30 '19 at 19:14
0

After loading the model, you can specify the new prefix path like so:

ldamodel.prefix = 'path/to/new/prefix'