MALLET Topic Modeling: Inconsistent Estimations

Question

I'm using MALLET to train a ParallelTopicModel. After training, I get a TopicInferencer, take a sentence, run it through the inferencer 15 times, and check the results. I'm finding that for some topics, the estimation is different each time and not consistent at all.

For example, with 20 topics, this is the output I'm getting for the estimated topic probabilities, for the same sentence:

[0.004888044738437717, 0.2961123293878907, 0.0023192114841146965, 0.003828168015645214, 0.3838058036596986, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.08833215086412127, 0.028410939897698755, 0.007213888668919175, 0.002902815632169611, 0.03362813995020485, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.26812948669964976, 0.0023192114841146965, 0.0038281680156452146, 0.35582296097145744, 0.002313049063676805, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.09765976509353493, 0.028410939897698755, 0.007213888668919175, 0.002902815632169611, 0.052283368409032215, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.2681294866996498, 0.0023192114841146965, 0.003828168015645214, 0.3931334178891125, 0.002313049063676805, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839043, 0.0019749423390935396, 0.002792447952547967, 0.018537939424381665, 0.09765976509353493, 0.03773855412711243, 0.007213888668919175, 0.0029028156321696105, 0.024300525720791197, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.2588018724702361, 0.0023192114841146965, 0.0038281680156452146, 0.3278401182832166, 0.002313049063676805, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.06967692240529397, 0.028410939897698755, 0.007213888668919175, 0.002902815632169611, 0.03362813995020485, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.2681294866996498, 0.0023192114841146965, 0.0038281680156452146, 0.5143924028714901, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.08833215086412126, 0.028410939897698755, 0.007213888668919175, 0.002902815632169611, 0.014972911491377543, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.20283618709375414, 0.0023192114841146965, 0.0038281680156452146, 0.29985727559497544, 0.0023130490636768045, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.11631499355236223, 0.028410939897698752, 0.007213888668919175, 0.002902815632169611, 0.024300525720791197, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437716, 0.43602654282909553, 0.0023192114841146965, 0.0038281680156452146, 0.2998572755949755, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.09765976509353493, 0.03773855412711241, 0.007213888668919175, 0.002902815632169611, 0.03362813995020485, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.07224958788196291, 0.0023192114841146965, 0.0038281680156452146, 0.3278401182832165, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.08833215086412129, 0.028410939897698755, 0.007213888668919175, 0.002902815632169611, 0.04295575417961857, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.2588018724702361, 0.0023192114841146965, 0.0038281680156452146, 0.4490991032655942, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.08833215086412127, 0.028410939897698755, 0.007213888668919175, 0.002902815632169611, 0.07093859686785953, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.24014664401140884, 0.0023192114841146965, 0.0038281680156452146, 0.26254681867732077, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.018537939424381665, 0.06967692240529395, 0.05639378258593975, 0.007213888668919175, 0.002902815632169611, 0.03362813995020485, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.2588018724702361, 0.0023192114841146965, 0.0038281680156452146, 0.3744781894302849, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.06967692240529398, 0.047066168356526085, 0.007213888668919175, 0.002902815632169611, 0.06161098263844586, 0.0085078656328731, 0.0071022047541209835, 0.012203497697416594]
[0.004888044738437717, 0.2681294866996498, 0.0023192114841146965, 0.0038281680156452146, 0.32784011828321646, 0.002313049063676805, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.08833215086412127, 0.03773855412711243, 0.007213888668919175, 0.002902815632169611, 0.024300525720791197, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.10956004479961755, 0.0023192114841146965, 0.0038281680156452146, 0.3838058036596989, 0.002313049063676805, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.018537939424381665, 0.11631499355236223, 0.03773855412711243, 0.007213888668919175, 0.002902815632169611, 0.03362813995020485, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.25880187247023617, 0.0023192114841146965, 0.0038281680156452146, 0.28120204713614816, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.09765976509353493, 0.03773855412711241, 0.007213888668919175, 0.002902815632169611, 0.08959382532668683, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437716, 0.2214914155525815, 0.0023192114841146965, 0.0038281680156452146, 0.37447818943028494, 0.002313049063676805, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.07900453663470762, 0.03773855412711243, 0.007213888668919175, 0.002902815632169611, 0.03362813995020485, 0.0085078656328731, 0.007102204754120983, 0.0028758834680029416]

As you can see, a few columns are very inconsistent. Why is this, and is there a way to prevent this? I'm using the distribution as features into another machine learning model, and having these inconsistencies are throwing my other model off.

My code:

ldaModel = new ParallelTopicModel(numTopics, alphaSum, beta);
instances = new InstanceList(new SerialPipes(pipeList));

for (int i = 0; i < data.length; i++) {
  String dataPt = data[i];
  Instance dataPtInstance = new Instance(dataPt, null, null, dataPt);
  instances.addThruPipe(dataPtInstance);
}
ldaModel.addInstances(instances);
ldaModel.setNumThreads(numThreads);
ldaModel.setNumIterations(numIterations);

try {
  ldaModel.setRandomSeed(DEFAULT_SEED);
  ldaModel.estimate();
  inferencer = ldaModel.getInferencer();
} catch (IOException e) {
  System.out.println(e);
}

String dataPt = "This is a test sentence.";
Instance dataPtInstance = new Instance(dataPt, null, null, dataPt);
InstanceList testList = new InstanceList(new SerialPipes(pipeList));
testList.addThruPipe(dataPtInstance);
double[] prob = inferencer.getSampledDistribution(testList.get(0), testIterations, thinIterations, burnInIterations);

score 1 · Answer 1 · answered Jun 27 '15 at 00:12

1

I believe I figured out why. Due to Gibbs sampling, the sampled output for estimation is not guaranteed to be the same every time. One way around it is to put 0 iterations of sampling.

answered Jun 27 '15 at 00:12

kk415kk

1,227
1
14
30

score 0 · Answer 2 · answered Aug 27 '15 at 09:40

0

If you want the inference to be consistent over multiple runs, you must also set the inferencer's random seed.

inferencer = ldaModel.getInferencer();
inferencer.setRandomSeed(DEFAULT_SEED);

Besides when training a model make sure to use a recent version as a bug with the random seed initialization was fixed around a year ago.

answered Aug 27 '15 at 09:40

Bernard

301
2
6

⚠️ Don't use DEFAULT_SEED=0 otherwise the [system clock](https://github.com/mimno/Mallet/blob/d9fd56680dd7731ddd2d497241dbd47162330877/src/cc/mallet/topics/tui/InferTopics.java#L47) is used by default and the results are not reproducible. – J.Schneider Jun 01 '21 at 12:12

MALLET Topic Modeling: Inconsistent Estimations

2 Answers2