In my project, I use the Python library gensim for topic modeling/extraction of text. I try to load my trained LdaMallet model to classify new unseen texts.
The first part is loading the model.
import os
dirname = os.path.dirname(__file__)
filename = os.path.join(dirname, 'mallet-2.0.8/bin/mallet')
# Download File: http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
os.environ['MALLET_HOME'] = # path to mallet
ldaMallet = gensim.models.wrappers.LdaMallet.load('lda_malletoutputCommentsAndMethods.model)
ldaModel = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldaMallet)
I am not sure about the last line which converts the ldaMallet to LdaModel. It was the only way to get some result.
Then the second part is preparing the new data and classify it.
from gensim.test.utils import common_dictionary
other_texts = [['new', 'document', 'to', 'classify', 'as', 'array']]
other_corpus = [common_dictionary.doc2bow(text) for text in other_texts]
vector = ldaModel[other_corpus[0]]
# sorts the result by probability and not by topic ID
print(sorted(vector, key=lambda x: x[1], reverse=True))
Then the result looks something like this:
[(16, 0.143), (17, 0.08), (9, 0.0653),...]
No matter which text I use in the other_texts
array, this result isn't changing, but it should.