2

I train a topic model with Mallet, and I want to serialize it for later use. I ran it on two test documents, and then deserialized it and ran the loaded model on the same documents, and the results were completely different.

Is there anything wrong with the way I'm saving/loading the documents (code attached)?

Thanks!

List<Pipe> pipeList = initPipeList();
// Begin by importing documents from text to feature sequences

InstanceList instances = new InstanceList(new SerialPipes(pipeList));

for (String document : documents) {
    Instance inst = new Instance(document, "","","");
    instances.addThruPipe(inst);
}

ParallelTopicModel model = new ParallelTopicModel(numTopics, alpha_t * numTopics, beta_w);
model.addInstances(instances);
model.setNumThreads(numThreads);
model.setNumIterations(numIterations);
model.estimate();

printProbabilities(model, "doc 1"); // I replaced the contents of the docs due to copywrite issues
printProbabilities(model, "doc 2");

model.write(new File("model.bin"));
model = ParallelTopicModel.read("model.bin");

printProbabilities(model, "doc 1");
printProbabilities(model, "doc 2");

Definition of printProbabilities():

public void printProbabilities(ParallelTopicModel model, String doc) {

    List<Pipe> pipeList = initPipeList();

    InstanceList instances = new InstanceList(new SerialPipes(pipeList));
    instances.addThruPipe(new Instance(doc, "", "", ""));

    double[] probabilities = model.getInferencer().getSampledDistribution(instances.get(0), 10, 1, 5);

    for (int i = 0; i < probabilities.length; i++) {
        double probability = probabilities[i];
        if (probability > 0.01) {
            System.out.println("Topic " + i + ", probability: " + probability);
        }
    }
}
Stefan Falk
  • 23,898
  • 50
  • 191
  • 378
user616254
  • 133
  • 1
  • 1
  • 4
  • Do you have a specific problem, or are you just looking for a code review? – jonafato Nov 10 '14 at 20:21
  • The problem is that I get different results for the same docs: Before serializing I get Topic 9, probability: 0.3304651162790718 Topic 60, probability: 0.5025581395348869 and after serializing and reloading the model I get: Topic 55, probability: 0.800833333333338 Topic 86, probability: 0.050833333333333626 – user616254 Nov 10 '14 at 20:35

2 Answers2

2

You have to use the same pipe for training and for classification. During traning, pipe's data alphabet gets updated with each training instance. You don't produce the same pipe using new SerialPipe(pipeList) as its data alphabet is empty. Save/load the pipe or instance list containing the pipe along with the model, and use that pipe to add test instances.

  • Just to make this clear, the code after training to save the instances: instances.save(new File("instances.dat")); and to load the instances before training: InstanceList instances = InstanceList.load(new File("instances.dat")); Hope this helps – c-chavez Jun 05 '17 at 23:26
0

When you don't fix a random seed, every run of Mallet gives you a different topic model (with the numbers of the topics permuted, some topics slightly different, other topics very different).

Fix the random seed to get replicable topics.

Sir Cornflakes
  • 675
  • 13
  • 26