I read the thread there but in this thread, the answer is saying using the column means for the cluster i (so if I have 3 clusters in total, i = 1,2,3; and if I have 2 clusters overall, i = 1,2). I will copy and paste their answers here:
clusters = cutree(hclust(dist(USArrests)), k=5) # get 5 clusters
# function to find medoid in cluster i
clust.centroid = function(i, dat, clusters) {
ind = (clusters == i)
colMeans(dat[ind,])
}
sapply(unique(clusters), clust.centroid, USArrests, clusters)
[,1] [,2] [,3] [,4] [,5]
Murder 11.47143 8.214286 5.59 14.2 2.95
Assault 263.50000 173.285714 112.40 336.0 62.70
UrbanPop 69.14286 70.642857 65.60 62.5 53.90
Rape 29.00000 22.842857 17.27 24.0 11.51
But that does not make sense to me! If I have a data set with 3 variables/columns, and I only want 2 clusters, using their method, only the column means for column 1 & 2 are used, and column mean for 3rd column will never be calculated!
Let's say I created such data table:
a = c(1,2,3,4,2,2,5,3,1)
b = c(4,5,2,2,1,1,1,1,3)
c = c(1,1,1,0,0,0,0,0,1)
abc = data.frame(a=a, b=b, c=c)
str(abc)
And the last line will return the following data table:
'data.frame': 9 obs. of 3 variables:
$ a: num 1 2 3 4 2 2 5 3 1
$ b: num 4 5 2 2 1 1 1 1 3
$ c: num 1 1 1 0 0 0 0 0 1
I then scale the data:
abc_scaled = scale(abc)
Calculate distance and create hierarchical cluster and cut the tree:
distance = dist(abc_scaled, method="euclidean")
hcluster = hclust(distance, method="ward.D")
clusters = cutree(hcluster, h = (max(hcluster$height) - 0.1))
Let's say i get 2 clusters as result, how can i compare the centroids of the 2 clusters? and how can i add the labels to the clusters???