I'm analyzing a data in R where predictor variables are available but there is no response variable. Using unsupervised learning (k-means) I have identified patterns in the data. But I need to rank the clusters according to their overall performance (example: student's data on exam marks and co-curricular marks). What technique do I use after clustering in R?
Asked
Active
Viewed 245 times
1 Answers
0
The cluster
attribute of the kmeans output gives you the index of which cluster each data point is in. Example data taken from kmeans
documentation:
nclusters = 5
# a 2-dimensional example
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
cl <- kmeans(x, nclusters, nstart = 25)
Now, your evaluation function (e.g. mean of column values) can be applied to each cluster individually:
for (i in 1:nclusters){
cat(i, apply(x[which(cl$cluster==i),],MARGIN=2,FUN=mean), '\n')
}
Or better still, use some kind of aggregation function, e.g. tapply
or aggregate
, e.g.:
aggregate(x, by=list(cluster=cl$cluster), FUN=mean)
which gives
cluster x y
1 1 1.2468266 1.1499059
2 2 -0.2787117 0.0958023
3 3 0.5360855 1.0217910
4 4 1.0997776 0.7175210
5 5 0.2472313 -0.1193551
At this point you should be able to rank the values of the aggregation function as needed.

njp
- 620
- 1
- 3
- 16