Coherence and Diagnostics File in Mallet

Question

In Mallet, we can get a diagnostics file including measuring coherence for each topic http://mallet.cs.umass.edu/diagnostics.php. In the Gensim, we have an overall score for each set of topics and a single score for each topic (https://radimrehurek.com/gensim/models/coherencemodel.html). I have two questions:

1- What is the name of the coherence method in the diagnostics file of Mallet?

2- If we want to measure an overall score using the coherence scores in the diagnostics file of Mallet, can we just measure the average of coherence scores?

score 1 · Answer 1 · answered Feb 21 '21 at 14:41

1

I've seen it called the UMass method, I don't know if there's a standard nomenclature. See Röder et al for a general treatment. The important variables are whether the reference corpus is the same as the training corpus (yes, so think of it as an upper bound), whether the statistic is probability or document frequency (df), the form of the equation (conditional probability and not PMI), and the smoothing factor (small, so words that never co-occur make a big difference).
You can, but the distribution matters. A few really rotten topics might destroy user confidence more than a small difference in many topics. The main thing to worry about with these co-occurrence-based coherence metrics is redundancy. It's trivial to create a large number of identical topics with high-frequency words that occur often together.

answered Feb 21 '21 at 14:41

David Mimno

1,836
7
7

As a side note on his second point, a good way to ensure that you're not getting redundant topics is to compute the topic diversity of a topic set, as in Dieng et al. http://www.cs.columbia.edu/~blei/papers/DiengRuizBlei2020a.pdf – rchurch4 Feb 24 '21 at 22:55
David, do you know why do I get the highest average of coherence values for the number of topics at 2? This means that the number of topics less than 10 always gets the highest average value for coherence. I tried this for different datasets with millions of documents. – Panda Feb 26 '21 at 16:27
This is another pathological case for coherence, similar to redundant topics. Think about the top words for a degenerate one-topic model: it's just the corpus frequencies. High frequency words are likely to occur together, so coherence is great. – David Mimno Mar 04 '21 at 00:57

Coherence and Diagnostics File in Mallet

1 Answers1