1

I'm analyzing a data in R where predictor variables are available but there is no response variable. Using unsupervised learning (k-means) I have identified patterns in the data. But I need to rank the clusters according to their overall performance (example: student's data on exam marks and co-curricular marks). What technique do I use after clustering in R?

njp
  • 620
  • 1
  • 3
  • 16
lingezh
  • 19
  • 2

1 Answers1

0

The cluster attribute of the kmeans output gives you the index of which cluster each data point is in. Example data taken from kmeans documentation:

nclusters = 5
# a 2-dimensional example
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")

cl <- kmeans(x, nclusters, nstart = 25)

Now, your evaluation function (e.g. mean of column values) can be applied to each cluster individually:

for (i in 1:nclusters){
    cat(i, apply(x[which(cl$cluster==i),],MARGIN=2,FUN=mean), '\n')
}

Or better still, use some kind of aggregation function, e.g. tapply or aggregate, e.g.:

aggregate(x, by=list(cluster=cl$cluster), FUN=mean)

which gives

  cluster          x          y
1       1  1.2468266  1.1499059
2       2 -0.2787117  0.0958023
3       3  0.5360855  1.0217910
4       4  1.0997776  0.7175210
5       5  0.2472313 -0.1193551

At this point you should be able to rank the values of the aggregation function as needed.

njp
  • 620
  • 1
  • 3
  • 16