I'm working in a decently-sized data set, and wish to identify what # topics make sense. I used both NMF and LDA (sklearn implementation), but the key question: what is a suitable measure for success. Visually I have in many topics only a few height-weight keywords (the other weights ~ 0), and a few topics with more bell-shaped distribution of the topics. What is the target: a topic with a few words, high weight, rest low (a spike) or a bell-shape distribution, gradual reduction of weights over a large # keywords
NMF
or the LDA method
that gives mostly a bell-shape (not curve, obviously)
I also use a weighted jaccard (set overlap of the keywords, weighted; there are no doubt better methods, but this is kind-of intuitive
Your thoughts on this?
best,
Andreas