5

According to the MALLET documentation, it's possible to train topic models incrementally:

"-output-model [FILENAME] This option specifies a file to write a serialized MALLET topic trainer object. This type of output is appropriate for pausing and restarting training"

I'd like to train topics on one set of data and then increment the model with a different set of data. After both training steps, I'd like to output states for both datasets (with --output-state). Here is how I try to do it:

# training on the first dataset
../mallet-2.0.7/bin/mallet import-dir --input input/ --keep-sequence --output input.mallet
../mallet-2.0.7/bin/mallet train-topics --input  input.mallet --num-topics 3 --output-state topic-state.gz --output-model model

# training on the second dataset
../mallet-2.0.7/bin/mallet import-dir --input input2/ --keep-sequence --output input2.mallet  --use-pipe-from input.mallet
../mallet-2.0.7/bin/mallet train-topics --input  input2.mallet --num-topics 3  --num-iterations 100 --output-state topic-state2.gz --input-model model

In the last command, if I add " --input-model model", the data from the 2nd dataset is not present in the output-state file. If I don't add it, the data from the 1st dataset is not present in the output-state file.

If I try to add additional instances to a model in the code:

model.addInstances(instances);
model.setNumThreads(2);
model.setNumIterations(50);
model.estimate();

[...]

model.addInstances(instances2);
model.setNumThreads(2);
model.setNumIterations(50);
model.estimate();

I get an error:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 30
    at cc.mallet.topics.ParallelTopicModel.buildInitialTypeTopicCounts(ParallelTopicModel.java:364)
    at cc.mallet.topics.ParallelTopicModel.addInstances(ParallelTopicModel.java:276)
    at cc.mallet.examples.TopicModel2.main(TopicModel2.java:66)

There have been similar questions on the MALLET list before: http://permalink.gmane.org/gmane.comp.ai.mallet.devel/924, http://permalink.gmane.org/gmane.comp.ai.mallet.devel/2139

So is incremental training of topic models possible?

vpekar
  • 3,275
  • 1
  • 19
  • 16

1 Answers1

0

I think you were part of this conversation thread which may be useful for you now.

http://comments.gmane.org/gmane.comp.ai.mallet.devel/2153
London guy
  • 27,522
  • 44
  • 121
  • 179