I am a data mining beginner and am trying to first formulate an approach to a clustering problem I am solving.
Suppose we have x writers, each with a particular style (use of unique words etc.). They each write multiple short texts, let's say a haiku. We collect many hundreds of these haikus from the authors and try to understand from the haikus, using context analysis, how many authors we had in the first place (we somehow lost records of how many authors there were, after a great war!)
Let's assume I create a hash table of words for each of these haikus. Then I could write a distance function that would look at the repetition of similar words between each vector. This could allow me to implement some sort of k-mean clustering function.
My problem now is to measure, probabilistically, the number of clusters, i.e. the number of authors, that would give me the optimum fit.
Something like:
number of authors | probability
1, 0.05
2, 0.1
3, 0.2
4, 0.4
5, 0.1
6, 0.05
7, 0.03
8, 0.01
The only constraint here is that as the number of authors (or clusters) goes to infinity, the sigma of the probabilities should converge onto 1, I think.
Does anyone have any thoughts or suggestions on how to implement this second part?