Mahout LDA how to predict the topic on test data set?

Question

From the apache Mahout website https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html I am able to see the procedure to fit an LDA model and output the computed topic in the form of P("word"|"topic number"). However, there is no information on how the trained model can be applied on a test data to predict the topic distribution. Or should we write our own program to use the output of conditional probablities to find the topics over a test data set?

There is an example in the [cluster-reuters.sh](http://svn.apache.org/repos/asf/mahout/trunk/examples/bin/cluster-reuters.sh) file of LDA topic clustering. You can find it in the examples directory. — Calavoow, Sep 30 '12 at 21:54
@Calavoow, the example you refer to does the training part. I think Rkz wants to get the topic distribution for a new set of documents using the trained model. — Sam, Nov 18 '13 at 18:35

score 0 · Accepted Answer · answered Dec 05 '12 at 02:26

Please have a look at publication by 2009 Wallach et. al. titled 'Evaluation Methods for Topic Models' here. Have a look at section 4, it mentions three methods to calculate P(z|w), one based on importance sampling and other two called 'Chib-style estimator' and 'left-to-right estimator'.

Mallet has implementation of left-to-right estimator method.

Mahout LDA how to predict the topic on test data set?

1 Answers1