Here's how I infer topic distributions for new documents using MALLET. I thought I would post since I have been looking how to do this and there are a lot of answers, but none of them are comprehensive. This includes the training steps as well so you get an idea of how the different files connect to each other.
Create your training data:
$BIN_DIR/mallet import-file --input $DIRECTORY/data.input --output $DIRECTORY/data.mallet --keep-sequence --token-regex '\w+'
where data.input
is a document containing your file ID, label, and a sequence of tokens or token IDs. Then train your model on this data with the parameters you like. For example:
$BIN_DIR/mallet train-topics --input $DIRECTORY/data.mallet \
--num-topics $TOPICS --output-state $DIRECTORY/topic-state.gz \
--output-doc-topics $DIRECTORY/doc-topics.gz \
--output-topic-keys $DIRECTORY/topic-words.gz --num-top-words 500 \
--num-iterations 1000
Later, you can create an inferencer using your trained model and training data:
bin/mallet train-topics --input $DIRECTORY/data.mallet --num-topics NUMBER --input-state $DIRECTORY/topic-state.gz --no-inference --inferencer-filename $DIRECTORY/inferencer-model
Now, create file for new documents using pipe from training data:
bin/mallet import-file --input $DIRECTORY/new_data.input --output $DIRECTORY/new_data.mallet --use-pipe-from $DIRECTORY/data.mallet --keep-sequence --token-regex '\w+'
Infer topics on new documents:
bin/mallet infer-topics --inferencer $DIRECTORY/inferencer-model --input $DIRECTORY/new_data.mallet --output-doc-topics $DIRECTORY/new_data_doc_topics --num-iterations 1000