how to predict topics for a batch of documents with mallet

Question

I am using mallet from a scala project. After training the topic models and got the inferencer file, I tried to assign topics to new texts. The problem is I got different results with different calling methods. Here are the things I tried:

creating a new InstanceList and ingest just one document and get the topic results from the InstanceList

somecontentList.map(text=>getTopics(text, model))
def getTopics(text:String, inferencer: TopicInferencer):Array[Double]={
    val testing = new InstanceList(pipe)
    testing.addThruPipe(new Instance(text, null, "test instance", null))
    inferencer.getSampledDistribution(testing.get(0), iter, 1, burnIn)
}

Put everything in a InstanceList and predict topics together.

val testing = new InstanceList(pipe)
somecontentList.foreach(text=>
    testing.addThruPipe(new Instance(text, null, "test instance", null))
)
(0 until testing.size).map(i=> 
    ldaModel.getSampledDistribution(testing.get(i), 100, 1, 50))

These two methods produce very different results except for the first instance. What is the right way of using the inferencer?

Additional information: I checked the instance data.

0: topic (0)
1: beaten (1)
2: death (2)
3: examples (3)
4: forum (4)
5: wanted (5)
6: contributing (6)

I assume the number in parenthesis is the index of words used in prediction. When I put all text into the InstanceList, the index is different because the collection has more text. Not sure how exactly that information is considered in the model prediction process.

score 1 · Accepted Answer · answered Sep 15 '18 at 18:12

1

Remember that the new instances must be imported with the pipe from the original data as recorded in the Inferencer in order for the alphabets to match. It's not clear where pipe is coming from in the scala code, but the fact that the first six words seem to have what looks like it might be ids starting with 0 suggests that this is a new alphabet.

answered Sep 15 '18 at 18:12

David Mimno

1,836
7
7

That's probably the reason. But how do I use the original pipe? There is no method in the `TopicInferencer` to get the original pipe. So I guess the pipe information is not stored in the TopicInferencer file. Do I have to store the alphabets separately or store the whole model? It is quite large and inefficient to serve as a realtime service. When I check the Inferencer file, it looks like words are stored in the file. I thought it will be words to topic probabilities. Should the model be self sufficient? – yang Sep 17 '18 at 16:14
The inferencer has the alphabet mapping strings to ints, but not the original pipe. If you have the original training instances file you can use that. I'm not sure what the best way through the API from scale would be. – David Mimno Sep 18 '18 at 14:03

score 0 · Answer 2 · answered Sep 14 '18 at 21:42

I too found similar issue, although with R plug in. We ended up calling the Inferencer for each row/document separately.

However, there will be some differences in inferences when you call for the same row, because of stochasticity in the drawing and inferencer. Although, I agree that the differences should be small.

how to predict topics for a batch of documents with mallet

2 Answers2