10

I am using fpc package in R to perform cluster validation.

I could use the function cluster.stats() to compare my clustering with an external partitioning and compute several metrics like Rand Index, entropy e.t.c.

However, I am looking for a metric called 'purity' or 'cluster accuracy' which is defined in http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

I am wondering if there is an implementation of this measure in R.

thanks, Chet

Amro
  • 123,847
  • 25
  • 243
  • 454
chet
  • 419
  • 6
  • 15

1 Answers1

13

I don't know of an off-the-shelf function, but here is one way you could do it yourself using the equation in your link:

ClusterPurity <- function(clusters, classes) {
  sum(apply(table(classes, clusters), 2, max)) / length(clusters)
}

Here we can test it on some random assignments, where I believe we expect the purity to be 1/number-of-classes:

> n = 1e6
> classes = sample(3, n, replace=T)
> clusters = sample(5, n, replace=T)
> ClusterPurity(clusters, classes)
[1] 0.334349
John Colby
  • 22,169
  • 4
  • 57
  • 69
  • 1
    That was short and easy! I use R quite infrequently and was beggining to write a long function to do this. Thanks so much for saving me time and teaching me one more thing in R. – chet Feb 16 '12 at 15:49
  • i want to do the same for gene expression matrix where my rows are Sample names and genes are columns ,how can i implement your function as i will get clusters assigned to data frame but what about classes ? can you show me a dummy example – kcm Oct 04 '19 at 10:57