Difficulty Comparing Clusters in R - Inconsistant Labeling

Question

I am attempting to run a monte carlo simulation that compares two different clustering techniques. The following code generates a dataset according to random clustering and then applies two clustering techniques (kmeans and sparse k means).

My issue is that these three techniques use different labels for their clusters. For example, what I call cluster 1, kmeans might call it cluster 2 and sparse k means might call it cluster 3. When I regenerate and re-run, the differences in labeling do not appear to be consistent. Sometimes the labels agree, sometimes they do not.

Can anyone provide a way to 'standardize' these labels so I can run n iterations of the simulation without having to manually resolve labeling differences each time?

My code:

library(sparcl)
library(flexclust)

x.generate=function(n,p,q,mu){
  c=sample(c(1,2,3),n,replace=TRUE)
   x=matrix(rnorm(p*n),nrow=n)
  for(i in 1:n){
   if(c[i]==1){
      for(j in 1:q){
        x[i,j]=rnorm(1,mu,1)
     }
   }
    if(c[i]==2){
      for(j in 1:q){
       x[i,j]=rnorm(1,-mu,1)
     }
   }
  }
  return(list('sample'=x,'clusters'=c))
}

x=x.generate(20,50,50,1)
w=KMeansSparseCluster.permute(x$sample,K=3,silent=TRUE)
kms.out = KMeansSparseCluster(x$sample,K=3,wbounds=w$bestw,silent=TRUE)
km.out = kmeans(x$sample,3)
tabs=table(x$clusters,kms.out$Cs)
tab=table(x$clusters,km.out$cluster)
CER=1-randIndex(tab)

Sample output of x$clusters, km.out$cluster, kms.out$Cs

> x$clusters 
 [1] 3 2 2 2 1 1 2 2 3 2 1 1 3 1 1 3 2 2 3 1 

> km.out$cluster 
 [1] 3 1 1 1 2 2 1 1 3 1 2 2 3 2 2 3 1 1 3 2 

> km.out$Cs 
 [1] 1 2 2 2 3 3 2 2 1 2 3 3 1 3 3 1 2 2 1 3

Keep in mind that `kmeans` (and from the looks of it, `KMeansSparseCluster` as well) is inherently random: it starts the algorithm with a random choice of centers. So expecting consistent output without specifying the starting points each time is probably unreasonable. — joran, Nov 07 '13 at 18:28

damienfrancois · Answer 1 · 2013-11-07T20:27:27.707

1

One of the most used criterion of similarity is the Jaccard distance See for instance Ben-Hur, A. Elissee, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered data. Pacific Symposium on Biocomputing (pp.6--17).

Others include

Fowlkes, E. B., & Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association , 78 , 553--569
Hubert, L., & Arabie, P . (1985). Comparing partitions. Journal of Classification , 2 , 193--218.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the Americ an Statistical Association , 66 , 846--850

edited Nov 07 '13 at 20:27

answered Nov 07 '13 at 19:27

damienfrancois

52,978
9
96
110

It would be great if you could briefly explain the idea behind the papers: e.g. I tried the last 2, and couldn't access them. – cbeleites unhappy with SX Nov 07 '13 at 20:10
I think they are brievely described [here](http://www.stat.washington.edu/mmp/Papers/icml05-compare-axioms.pdf) – damienfrancois Nov 07 '13 at 20:19
The similarity measure I am attempting to use is the Rand statistic from the 1971 paper you mention – user2966058 Nov 07 '13 at 20:33

score 0 · Answer 2 · answered Nov 07 '13 at 19:56

As @Joran points out, the clusters are nominal, and thus do not have an order per se.

Here are 2 heuristics that come to my mind:

Starting from the tables you calculate already: when the clusters are well aligned, the trace of the tab matrix is maximal.
If the number of clusters is small, you could find the maximum by trying all permutations of 1 : n of method 2 against the $n$ clusters of method 1. If it is too large, you may go with a heuristic that first puts the biggest match onto the diagonal and so on.
Similarly, the trace of the distance matrix between the centroids of the 2 methods should be minimal.

score 0 · Answer 3 · answered Nov 09 '13 at 16:33

K-means is a randomized algorithm. You must expect them to be randomly ordered, actually.

That is why the established evaluation methods for clusters (read the Wikipedia article on clustering, in particular the section on "external validation") do not assume that there is a one-on-one mapping of clusters.

Even worse, one clustering algorithm may find 3 clusters, another one may find 4 clusters.

There are also hierarchical clustering algorithms. There each object can belong to many clusters, as clusters can be nested in each other.

Also some algorithms such as DBSCAN have a notion of "noise": These objects do not belong to any cluster.

score 0 · Answer 4 · edited Apr 13 '17 at 12:44

I would not recommend the Jaccard distance (even though it is famous and well established) as it is hugely influenced by cluster sizes. This is due to the fact that it counts node pairs rather than nodes. I also find the methods with a statistical flavour to be missing the point. The point is that the space of partitions (clusterings) have a beautiful lattice structure. Two distances that work beautifully within that structure are the Variation of Information (VI) distance and the split/join distance. See also this answer on stackexchange:

https://stats.stackexchange.com/questions/24961/comparing-clusterings-rand-index-vs-variation-of-information/25001#25001

It includes examples of all three distances discussed here (Jaccard, VI, split/join).

Difficulty Comparing Clusters in R - Inconsistant Labeling

4 Answers4