I am trying to use MALLET to first train a topic model and then use the inferencer from that model on a set of new documents.
From two other threads here and on the MALLET mailing list, I've gathered that is is important to ensure compatibility of training and test data by using the --use-pipe-from [MALLET TRAINING FILE]
option.
However, I prune the training data after importing it and fit the model on the pruned data.
/bin/mallet import-dir --input Train --output Train.mallet --keep-sequence --remove-stopwords
/bin/mallet prune --input $Train.mallet --output Train_pruned.mallet --prune-document-freq 10
/bin/mallet train-topics --input Train_pruned.mallet --inferencer-filename inferencer --output-doc-topics doc-topics.txt
Now when trying to import the test data, the command
/bin/mallet import-dir --input Test --use-pipe-from Train_pruned.mallet
leads to an error saying "the alphabets don't match."
mallet Alphabets don't match: Instance: [null, null], InstanceList: [50024, 1]
The same command with --use-pipe-from Train.mallet
instead of Train_pruned.mallet
seems to run. I am unsure, though, if that actually leads to "compatible" data, as the inferred topic proportions for the test documents seem a bit weird to me.
Edit: the full commands are:
(1) Import the training data ✔
c:/mallet/bin/mallet import-dir --input “./data/Train” --output “./models/Train.mallet" --keep-sequence --remove-stopwords --extra-stopwords “./my_stoplist.txt"
(2) Prune the training data ✔
c:/mallet/bin/mallet prune --input “./models/Train.mallet" --output “./models/Train_pruned.mallet" --prune-document-freq 10 --min-idf 0.1
(3) Train the model: ✔
c:/mallet/bin/mallet train-topics --input “./models/Train_pruned.mallet" --optimize-interval 20 --num-topics 25 --output-topic-keys “./models/keys.txt” --output-doc-topics “./models/compostion.txt” --inferencer-filename “./models/inferencer.inferencer" --random-seed 1234
(4) Import the test data ✗
c:/mallet/bin/mallet import-dir --input “./data/Test” --output “./models/Test.mallet" --keep-sequence --remove-stopwords --extra-stopwords “./my_stoplist.txt" --use-pipe-from “./models/Train_pruned.mallet"
Where (4) leads to the following error message:
Exception in thread "main" java.lang.IllegalArgumentException: Alphabets don't match: Instance: [null, null], InstanceList: [50024, 1]
at cc.mallet.types.InstanceList.add(InstanceList.java:335) at cc.mallet.types.InstanceList.addThruPipe(InstanceList.java:267) at cc.mallet.classify.tui.Text2Vectors.main(Text2Vectors.java:322)