0

I've used the following command to generate a topic model from some documents:

bin/mallet train-topics --input topic-input.mallet --num-topics 100 --output-state topic-state.gz

I have not, however, used the --output-model option to generate a serialized topic trainer object. Is there any way I can use the state file to infer topics for new documents? Training is slow, and it'll take a couple of days for me to retrain, if I have to create the serialized model from scratch.

Mountain
  • 211
  • 3
  • 11
sandesh247
  • 1,658
  • 1
  • 18
  • 24

3 Answers3

1

We did not use the command line tools shipped with mallet, we just use the mallet api to create the serialized model for inferences of the new document. Two point need special notice:

  • You need serialize out the pipes you used just after you finish the training (For my case, it is SerialPipes)
  • And of cause the model need also to be serialized after you finish the training(For my case, it is ParallelTopicModel)

Please check with the java doc:

Mountain
  • 211
  • 3
  • 11
0

If you mean you want to see how new documents fit into a previously trained topic model, then I'm afraid there is no simple command you can use to do it right. The class cc.mallet.topics.LDA in mallet 2.0.7's source code provides such a utility, try to understand it and use it in your program. P.S., If my memory serves, there is some problem with the implementation of the function in that class:

public void addDocuments(InstanceList additionalDocuments, 
                         int numIterations, int showTopicsInterval,
                         int outputModelInterval, String outputModelFilename,
                         Randoms r)

You have to rewrite it.

Shockley
  • 307
  • 2
  • 14
0

Restoring a model from the state file appears to be a new feature in mallet 2.0.7 according to the release notes.

Ability to restore models from gzipped "state" files. From the new TopicTrainer, use the --input-state [filename] argument. Note that you can manually edit this file. Any token with topic set to -1 will be immediately resampled upon loading.

John Lehmann
  • 7,975
  • 4
  • 58
  • 71