Clustering unstructured text based on similarity and calculating optimum number of clusters

Question

I am a data mining beginner and am trying to first formulate an approach to a clustering problem I am solving.

Suppose we have x writers, each with a particular style (use of unique words etc.). They each write multiple short texts, let's say a haiku. We collect many hundreds of these haikus from the authors and try to understand from the haikus, using context analysis, how many authors we had in the first place (we somehow lost records of how many authors there were, after a great war!)

Let's assume I create a hash table of words for each of these haikus. Then I could write a distance function that would look at the repetition of similar words between each vector. This could allow me to implement some sort of k-mean clustering function.

My problem now is to measure, probabilistically, the number of clusters, i.e. the number of authors, that would give me the optimum fit.

Something like:

number of authors | probability
1, 0.05
2, 0.1
3, 0.2
4, 0.4
5, 0.1
6, 0.05
7, 0.03
8, 0.01

The only constraint here is that as the number of authors (or clusters) goes to infinity, the sigma of the probabilities should converge onto 1, I think.

Does anyone have any thoughts or suggestions on how to implement this second part?

Are you sure you the algorithm will **cluster by authors, and not by topics**? or by languages? or by gender? Chances are that the algorithm discovers some very different structure than authors. — Has QUIT--Anony-Mousse, Aug 13 '14 at 14:31
Hi - that is a fair call. There is an implicit assumption here that the data is engineered to be different by one primary dimension. The clustering algorithm is not really the problem, it's more finding the probability distribution corresponding to the clustering algorithm that is of interest. — Ryan, Aug 13 '14 at 15:42

score 1 · Answer 1 · answered Aug 13 '14 at 22:00

Let's formulate an approach using Bayesian statistics.

Pick a prior P(K) on the number of authors, K. For example, you might say K ~ Geometric(p) with support {1, 2, ... } where E[K] = 1 / p is the number of authors you expect there to be prior to seeing any writings.
Pick a likelihood function L(D|K) that assigns a likelihood to the writing data D given a fixed number of authors K. For example, you might say L(D|K) is the total amount of error in a k-component GMM found by expectation-maximization. To be really thorough, you could learn L(D|K) from data: the internet is full of haikus with known authors.
Find the value of K that maximizes the posterior probability P(K|D) - your best guess at the number of authors. Note that since P(K|D) = P(D|K)P(K)/P(D), P(D) is constant, and L(D|K) is proportional to P(D|K), you have:

max { P(K|D) | K = 1, 2, ... } = max { L(D|K)P(K) | K = 1, 2, ... }

With respect to your question, the first column in your table corresponds to K and the second column corresponds to a normalized P(K|D); that is, it is proportional to L(D|K)P(K).

Clustering unstructured text based on similarity and calculating optimum number of clusters

1 Answers1