According to the MALLET documentation, it's possible to train topic models incrementally:
"-output-model [FILENAME] This option specifies a file to write a serialized MALLET topic trainer object. This type of output is appropriate for pausing and restarting training"
I'd like to train topics on one set of data and then increment the model with a different set of data. After both training steps, I'd like to output states for both datasets (with --output-state). Here is how I try to do it:
# training on the first dataset
../mallet-2.0.7/bin/mallet import-dir --input input/ --keep-sequence --output input.mallet
../mallet-2.0.7/bin/mallet train-topics --input input.mallet --num-topics 3 --output-state topic-state.gz --output-model model
# training on the second dataset
../mallet-2.0.7/bin/mallet import-dir --input input2/ --keep-sequence --output input2.mallet --use-pipe-from input.mallet
../mallet-2.0.7/bin/mallet train-topics --input input2.mallet --num-topics 3 --num-iterations 100 --output-state topic-state2.gz --input-model model
In the last command, if I add " --input-model model", the data from the 2nd dataset is not present in the output-state file. If I don't add it, the data from the 1st dataset is not present in the output-state file.
If I try to add additional instances to a model in the code:
model.addInstances(instances);
model.setNumThreads(2);
model.setNumIterations(50);
model.estimate();
[...]
model.addInstances(instances2);
model.setNumThreads(2);
model.setNumIterations(50);
model.estimate();
I get an error:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 30
at cc.mallet.topics.ParallelTopicModel.buildInitialTypeTopicCounts(ParallelTopicModel.java:364)
at cc.mallet.topics.ParallelTopicModel.addInstances(ParallelTopicModel.java:276)
at cc.mallet.examples.TopicModel2.main(TopicModel2.java:66)
There have been similar questions on the MALLET list before: http://permalink.gmane.org/gmane.comp.ai.mallet.devel/924, http://permalink.gmane.org/gmane.comp.ai.mallet.devel/2139
So is incremental training of topic models possible?