0

I am trying to use MALLET to first train a topic model and then use the inferencer from that model on a set of new documents. From two other threads here and on the MALLET mailing list, I've gathered that is is important to ensure compatibility of training and test data by using the --use-pipe-from [MALLET TRAINING FILE] option. However, I prune the training data after importing it and fit the model on the pruned data.

/bin/mallet import-dir --input Train --output Train.mallet --keep-sequence --remove-stopwords
/bin/mallet prune --input $Train.mallet --output Train_pruned.mallet --prune-document-freq 10
/bin/mallet train-topics  --input Train_pruned.mallet --inferencer-filename inferencer --output-doc-topics doc-topics.txt

Now when trying to import the test data, the command

/bin/mallet import-dir --input Test --use-pipe-from Train_pruned.mallet

leads to an error saying "the alphabets don't match."

mallet Alphabets don't match: Instance: [null, null], InstanceList: [50024, 1]

The same command with --use-pipe-from Train.mallet instead of Train_pruned.mallet seems to run. I am unsure, though, if that actually leads to "compatible" data, as the inferred topic proportions for the test documents seem a bit weird to me.

Edit: the full commands are:

(1) Import the training data ✔

c:/mallet/bin/mallet import-dir --input “./data/Train” --output “./models/Train.mallet" --keep-sequence --remove-stopwords --extra-stopwords “./my_stoplist.txt"

(2) Prune the training data ✔

c:/mallet/bin/mallet prune --input “./models/Train.mallet" --output “./models/Train_pruned.mallet" --prune-document-freq 10 --min-idf 0.1

(3) Train the model: ✔

c:/mallet/bin/mallet train-topics  --input “./models/Train_pruned.mallet" --optimize-interval 20 --num-topics 25 --output-topic-keys “./models/keys.txt” --output-doc-topics “./models/compostion.txt” --inferencer-filename “./models/inferencer.inferencer" --random-seed 1234

(4) Import the test data ✗

c:/mallet/bin/mallet import-dir --input “./data/Test” --output “./models/Test.mallet" --keep-sequence --remove-stopwords --extra-stopwords “./my_stoplist.txt" --use-pipe-from “./models/Train_pruned.mallet"

Where (4) leads to the following error message:

Exception in thread "main" java.lang.IllegalArgumentException: Alphabets don't match: Instance: [null, null], InstanceList: [50024, 1]

    at cc.mallet.types.InstanceList.add(InstanceList.java:335)
    at cc.mallet.types.InstanceList.addThruPipe(InstanceList.java:267)
    at cc.mallet.classify.tui.Text2Vectors.main(Text2Vectors.java:322)
  • Pruning shouldn't make any difference. What command specifically is causing the Alphabets don't match error? (It can't be the import, since there's nothing to match to) – David Mimno Sep 15 '19 at 21:35
  • Thank you for your reply, David! It actually IS the import command, though. At least when using the ```-use-pipe-from``` option with the pruned data. Or am I making a mistake elsewhere? – I_love_Norway Sep 16 '19 at 10:43

0 Answers0