I want to run a model in Mallet and need the topic-docs output, which gives the most prominent documents for each topic. This is necessary for interpreting the less clear topics correctly. But Mallet keeps on giving me empty txt files.
This is the command I use:
bin\mallet train-topics --input cleandata1000.mallet --num-topics 250 --num-iterations 3000 --optimize-interval 50 --optimize-burn-in 50 --output-topic-keys 1000-300-3000-50-topic-keys.txt --output-topic-docs 1000-300-1000-50-topic-docs.txt --num-top-docs 20 --output-doc-topics 1000-300-1000-50-doc-topics.txt --doc-topics-threshold 0.01 --xml-topic-phrase-report 1000-300-1000-50-topic-phrase.xml --output-state 1000-300-1000-50-state.gz --use-symmetric-alpha true
Does anyone know what the cause could be?
Edit in response to David Mimno's 4 Nov comment:
The same thing happens with different data (where the docs have a different lenght).
I just ran some other models with Mallet's test data. Peculiar: This trial gave no output at all (so the "en-topic-docs.txt" did not get made).
bin\mallet train-topics --input en.mallet --num-topics 5 --output-topic-docs en-topic-docs.txt
When I ask for the topic keys as output, both files are made, but the en-topic-docs.txt is empty.
bin\mallet train-topics --input en.mallet --num-topics 5 --output-topic-keys en-topic-keys.txt --output-topic-docs en-topic-docs.txt
My bad: there is a recurring error message:
Exception in thread "main" java.lang.ClassCastException: class java.net.URI cannot be cast to class java.lang.String (java.net.URI and java.lang.String are in module java.base of loader 'bootstrap') at cc.mallet.topics.ParallelTopicModel.printTopicDocuments(ParallelTopicModel.java:1773) at cc.mallet.topics.tui.TopicTrainer.main(TopicTrainer.java:281)
I don't know what this might mean.
Thank you for any help, you are saving my PhD :)