I am using mallet from a scala project. After training the topic models and got the inferencer file, I tried to assign topics to new texts. The problem is I got different results with different calling methods. Here are the things I tried:
creating a new InstanceList and ingest just one document and get the topic results from the InstanceList
somecontentList.map(text=>getTopics(text, model)) def getTopics(text:String, inferencer: TopicInferencer):Array[Double]={ val testing = new InstanceList(pipe) testing.addThruPipe(new Instance(text, null, "test instance", null)) inferencer.getSampledDistribution(testing.get(0), iter, 1, burnIn) }
Put everything in a InstanceList and predict topics together.
val testing = new InstanceList(pipe) somecontentList.foreach(text=> testing.addThruPipe(new Instance(text, null, "test instance", null)) ) (0 until testing.size).map(i=> ldaModel.getSampledDistribution(testing.get(i), 100, 1, 50))
These two methods produce very different results except for the first instance. What is the right way of using the inferencer?
Additional information: I checked the instance data.
0: topic (0)
1: beaten (1)
2: death (2)
3: examples (3)
4: forum (4)
5: wanted (5)
6: contributing (6)
I assume the number in parenthesis is the index of words used in prediction. When I put all text into the InstanceList, the index is different because the collection has more text. Not sure how exactly that information is considered in the model prediction process.