LDAvis provides a excellent way of visualsing and exploring topic models. LDAvis requires 5 parameters:
- phi (matrix with dimensions number of terms times number of topics)
- theta (matrix with dimensions number of documents times number of topics)
- number of words per document (integer vector)
- the vocabulary (character vector)
- the word frequency in the whole corpus (integer vector)
The short version of my question is: after fitting a LDA model with vowpal wabbit, how do one derive phi and theta?
theta represents the mixture of topics per document, and must thus sum to 1 per document. phi represents the probability of a term given the topic, and must thus sum to 1 per topic.
After running LDA with vowpal wabbit (vw
) some kind of weights are stored in a model. A human readable version of that model can be aquired by feeding a special file, with one document per term in the vocabulary while inactivating learning (by the -t
parameter), e.g.
vw -t -i weights -d dictionary.vw --readable_model readable.model.txt
According to the documentation of vowpal wabbit, all columns expect the first one of readable.model.txt
now "represent the per-word topic distributions."
You can also generate predictions with vw
, i.e. for a collection of documents
vw -t -i weights -d some-documents.txt -p predictions.txt
Both predictions.txt
and readable.model.txt
has a dimension that reflects the number of inputs (rows) and number of topics (columns), and none of them are probability distributions, because they do not sum to 1 (neither per row, nor per column).
I understand that vw
is not for the faint hearted and that some programming/scripting will be required on my part, but I'm sure there must be some way to derive theta and phi from some the output of vw
. I've been stuck on this problem for days now, please give me some hints.