2

Regarding the model output options in mallet:

--output-model [FILENAME]
--output-state [FILENAME] 
--output-doc-topics [FILENAME] 
--output-topic-keys [FILENAME]

Is there a specification for the text file (which column corresponds to which value), which goes beyond this general description.

Michael Dorner
  • 17,587
  • 13
  • 87
  • 117

2 Answers2

1

The output format of these 2 files

--output-doc-topics [FILENAME] 
--output-topic-keys [FILENAME]

is a csv file (tab-separated values in a text file). It is really easy to read off what is going on in these two files; a little unusual is the fact that the topics are sorted by the strength and the topic numbers are a necessary part of the doc-topics file.

The former 2 files

--output-model [FILENAME]
--output-state [FILENAME]

is "Java serialization data, version 5" (output from the UNIX file command); I am not aware of a deeper documentation of the details.

Michael Dorner
  • 17,587
  • 13
  • 87
  • 117
Sir Cornflakes
  • 675
  • 13
  • 26
0

Please edit if you find something useful!

--output-topic-keys The first column is the topic ID number, corresponding to the original order that each label first appeared in the training data. The second column is the label string. The third column is the total number of tokens assigned to that topic at the particular Gibbs sampling state where we stopped. The last column is a space-delimited list of 20 words in descending order by probability in the topic.

Michael Dorner
  • 17,587
  • 13
  • 87
  • 117