What is the best model for topic spotting within short unstructured documents, ex. SMS or Twitter messages? Latent Dirichlet allocation?
3 Answers
LDA is one of the strongest models available for topic modeling, but applying it to very short texts such as Twitter/microblog posts might require some extra work. The authors of this paper discuss LDA and an alternative model and recommend aggregating multiple posts before running a topic model on it.
[Watch out with terminology: "topic spotting" is actually an old synonym for supervised document classification.]

- 355,277
- 75
- 744
- 836
Applying topic models, such as LDA, for short texts (e.g. Tweets) is more challenging because of data sparsity and the limited contexts in such texts. One approach is to combine short texts into long pseudo-documents before training LDA. Another simple approach is to assume that there is only one topic per document.
The one-topic-per-document Dirichlet Multinomial Mixture (DMM) model (mixture of unigrams) is better than the LDA topic model for modeling topics on short texts or Tweets. You can find implementations of both LDA and DMM models in the jLDADMM packages. jLDADMM also provides a document clustering evaluation to compare these topic models.

- 470
- 5
- 8
I think all is dependent on data. So you should also try pure TFIDF, LSI, LDA, kmeans, hierarchical clustering to detect useful phrases, topics.

- 14,489
- 21
- 77
- 126