0

Basically, I simulate 1000's of data sets and then cluster them through different clustering techniques like: k-means, model-based clustering, etc.

Then, I can validate the performance of the methods using the Classification Correct Rate CCR. However, I face the label switching problem, and thus can't get realistic CCR. So, my question, is there a way to unify all the labels in r for multivariate data sets ?

Here is a simple example:

  # Create the random data sets:

  data1 <- rnorm(5, 0, 0.5) # cluster 1

  data2 <- rnorm(5, 2, 0.5) # cluster 2

  data3 <- rnorm(5, 4, 0.5) # cluster 3

  alldata <- c(data1, data2, data3)

  # cluster the data using different methods:

  require(cluster)

  km.method <- kmeans(alldata, centers = 3)$cluster
  # [1] 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2

  pam.method <- pam(alldata, 3)$clustering
  # [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3


  # As you see the answers are exactly the same, but the labels are different! 
  # How I can unify the labels for all methods to match the true labels??
meme
  • 21
  • 4
  • Not to be mean, but your question doesn't make sense. If you could "unify them to match true labels", why would you use clustering? Clustering is not classification by itself. If you know that the *ground-truth* grouping looks something like `rep(1L:3L, each=5L)`, you could use *cluster validity indices* to evaluate performance. Function `comPart` in the `flexclust` package provides some indices. – Alexis Jun 21 '18 at 21:50
  • 1
    @Alexis Thank you, yes I know about the ARI where I can use it without being concerned about the labels distribution in the results. However, this time I need to use the CCR besides the ARI. Regarding classification v.s clustering (you are right), but as I mentioned above this is a simulation work so I created the different groups on my own and I test different clustering techniques proposed by some papers. The second stage will be applying these clustering methods on real world data, where I don't know the true groups. So, basically my question relates to the simulation part. – meme Jun 26 '18 at 10:59

1 Answers1

0

CCR is not an appropriate measure for clustering.

As clusterers do not provide classes, it by definition is 0.

Consider the Iris data set. The correct classes are the species. Clusterings like k-means will produce "labels" 0,1,2. None of these is correct.

The proper way to evaluate clustering is to use a cluster evaluation measure, such as the adjusted Rand index and normalized mutual information. These evaluate the set overlap, and not the individual labels.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194