0

Dirichlet distribution is used in document modelling.

I read from this article that:

Different Dirichlet distributions can be used to model documents by different authors or documents on different topics.

So how could we tell whether it is modelling about different authors or about different topics? This is important because in a document clustering task, it directly dictates the semantic of the clustering result.

And I found it too subjective to limit the possible aspects of modelling to only author or topic. Since there seems to be no strong evidence to favor a specific aspect, it could be any other potential/latent aspect.

Could anyone shed some light on this?

smwikipedia
  • 61,609
  • 92
  • 309
  • 482

2 Answers2

2

It is not modeling authors or topics at all, but latent features, which might well map to real-world concepts like author or topic. For any latent feature, you can see which documents are most strongly associated, and maybe develop an intuitive interpretation of what the feature is "about".

Sean Owen
  • 66,182
  • 23
  • 141
  • 173
  • Appreciate your reply. I am trying to use the LDA (Latent Dirichlet Allocation) algorithm to cluster documents into topics. So how could we target the document clustering criteria to topics? – smwikipedia Feb 22 '14 at 12:53
  • 1
    I suppose that if you knew the topics ahead of time, and knew some words that were definitely associated to the topics, you could use those as the initial starting point of the algorithm rather than drawing from a distribution. The algorithm would then adapt your starting definition of topics to learn from the data, which may be what you intend. Of course if you knew already exactly which topic was associated to every word then LDA is not necessary; it's one trivial step to assign each document to a topic. – Sean Owen Feb 22 '14 at 14:12
  • This seems to be a chicken-egg dilemma. If I know the topic number and the topic words ahead, then it will be just a classification problem. But my current problem is, I don't know the topic number, and I don't know the topic **granularity** either so I cannot decide which words are topic-associated, because the word could be too broad or too narrow for a topic. Could there be any possible solution to this? – smwikipedia Feb 22 '14 at 14:51
  • A vague hint will also be appreciated. – smwikipedia Feb 22 '14 at 15:18
  • Yes, it's an unsupervised problem. You can't know the right number of topics ahead of time, not least of which because it's not clear what "right" means. It may be that just picking a number -- 20 -- gives results that are meaningful. Or for some definition of "right", you can pick the number that seems to fit the data best. – Sean Owen Feb 22 '14 at 19:25
2

It sounds like you're making a common mistake when thinking about LDA.

LDA is not a document clustering method. Any attempt to assign a topic to a document is incorrect given the model; indeed, any attempt to assign topics to words is also not correct. Instead, LDA is a way of looking at collections of documents, and looking at the way that topics are mixed within those documents. To put it another way, each document does not have a single topic, it has a distribution over topics. This is not uncertainty as to which topic the document belongs to, but rather the proportion of topics used within that document. Given a document you can compute the distribution over topic mixtures within that document; given a collection of documents you can infer both the mixtures within each document and also the topics that best describe that collection. Each word also has uncertainty as to which topic it comes from, since by definition each topic can emit every possible word, but their emission is more probable from some topics than others.

To answer your original question about whether the topics reflect author, topic, style, register, or whatever: the topics don't explicitly represent any of these. They represent groupings of words. Each topic is a distribution over the vocabulary, and so different topics represent different tendencies for word use: in a collection of homogeneous authorship but heterogeneous topic, these might correspond to an intuitive notion of "topic" (i.e. subject matter); in a collection of heterogeneous authors but homogeneous topic, perhaps different topics would correlate with different authors. In a collection of mixed topic, author, register, genre, etc. they may not correspond to any observable characteristic at all.

Instead, the topics are an abstract construction, and all the final topics tell you is what the best topics are for allowing you to reconstruct the original input assuming the model is correct. The sad truth is that this might not correspond to what you want the topics to correspond to because the thing you're really interested in (authorship, say) covaries with other things you're not interested in (register, topic, genre) in the collection you provide. Unless you explicitly mark all the things that could be responsible for a shift in usage of vocabulary, as expressed in a bag of words model, and then devise a model which accounts for them all (not vanilla LDA for certain), you simply won't be able to guarantee correspondence between the topics induced and groupings on the dimension you care about.

Ben Allison
  • 7,244
  • 1
  • 15
  • 24
  • So the *topic* modeled with Dirichlet distribution is just *a distribution over words*. No more, no less. And literally. – smwikipedia Feb 27 '14 at 09:16
  • Then shame on the first guy who used the misleading word *topic* to sell Dirichlet distribution. It makes the idea SO appealing BUT actually in an inappropriate way – smwikipedia Feb 27 '14 at 10:14
  • To be fair to him, he does say they're latent features, not topics. It's an issue that researchers face, I suppose---do they use words understandable to non-practitioners to increase impact and acceptance in the broader field, or stick to highly technical but accurate descriptions? – Ben Allison Feb 27 '14 at 10:40