I am attempting to run a monte carlo simulation that compares two different clustering techniques. The following code generates a dataset according to random clustering and then applies two clustering techniques (kmeans and sparse k means).
My issue is that these three techniques use different labels for their clusters. For example, what I call cluster 1, kmeans might call it cluster 2 and sparse k means might call it cluster 3. When I regenerate and re-run, the differences in labeling do not appear to be consistent. Sometimes the labels agree, sometimes they do not.
Can anyone provide a way to 'standardize' these labels so I can run n iterations of the simulation without having to manually resolve labeling differences each time?
My code:
library(sparcl)
library(flexclust)
x.generate=function(n,p,q,mu){
c=sample(c(1,2,3),n,replace=TRUE)
x=matrix(rnorm(p*n),nrow=n)
for(i in 1:n){
if(c[i]==1){
for(j in 1:q){
x[i,j]=rnorm(1,mu,1)
}
}
if(c[i]==2){
for(j in 1:q){
x[i,j]=rnorm(1,-mu,1)
}
}
}
return(list('sample'=x,'clusters'=c))
}
x=x.generate(20,50,50,1)
w=KMeansSparseCluster.permute(x$sample,K=3,silent=TRUE)
kms.out = KMeansSparseCluster(x$sample,K=3,wbounds=w$bestw,silent=TRUE)
km.out = kmeans(x$sample,3)
tabs=table(x$clusters,kms.out$Cs)
tab=table(x$clusters,km.out$cluster)
CER=1-randIndex(tab)
Sample output of x$clusters, km.out$cluster, kms.out$Cs
> x$clusters
[1] 3 2 2 2 1 1 2 2 3 2 1 1 3 1 1 3 2 2 3 1
> km.out$cluster
[1] 3 1 1 1 2 2 1 1 3 1 2 2 3 2 2 3 1 1 3 2
> km.out$Cs
[1] 1 2 2 2 3 3 2 2 1 2 3 3 1 3 3 1 2 2 1 3