-1

Say I have items i1, ..., iN

I would like to cluster them in such a way that:

  1. If I ran the cluster many many times the probability that items iJ and iK would end up in the same cluster is high.
  2. The number of clusters and cluster memberships are relatively stable regardless of cluster seeds

Are there well known algorithms to achieve this?

Clarification:

say I want 3 clusters and say:

  • in reality-1 I start with i1, i33, i89 as seeds for cluster c1 c2 c3
  • in reality-2 I start with i44, i55, i77 as seeds for cluster c1 c2 c3

I want the resulting clusters in both realities to be largely similar

user1172468
  • 5,306
  • 6
  • 35
  • 62

2 Answers2

2

I think that hierarchical clustering algorithms will meet your needs.

  1. Cluster consistency is garanteed for the same set, probability that items iJ and iK would end up in the same cluster is 1.
  2. There is no seed. You choose the right number of cluster by analysing the tree, or using existing cut off algorithms (there are a LOT of them).

[EDIT]

In fact any deterministic clustering algorithm has these features, not just hierarchical clustering.

CTZStef
  • 1,675
  • 2
  • 18
  • 47
  • 3
    In fact **any** deterministic clustering algorithm has these features, not just hierarchical clustering, also k-means determinization techniques, etc. – lejlot Oct 11 '13 at 19:26
  • So what I mean was: say I want 3 clusters and say I start with i1, i33, i89 as seeds for cluster c1 c2 c3 or if I started with seeds i44, i55, i77 -- the resulting clusters in both cases would be largely similar -- – user1172468 Oct 11 '13 at 19:29
  • 1
    @lejlot, true, hierarchical clustering is the first that came to my mind. Will update my answer to take your remark into account. user1172468 : there is no seed in h.c. – CTZStef Oct 11 '13 at 19:54
  • @CTZStef ... no seed in h.c. hmmmmm I need to dig up my notes again - lol – user1172468 Oct 11 '13 at 23:10
1

A often-seen strategy to make an algorithm more robust with respect to initialization, is to bootstrap it. See for instance this paper.

The other option is to sort the data beforehand and use a strictly deterministic algorithm.

damienfrancois
  • 52,978
  • 9
  • 96
  • 110